Steadfast Networks Outage

Portal Home > Knowledgebase > Industry Announcements > Web Hosting Main Forums > Providers and Network Outages and Updates > Steadfast Networks Outage

Posted by KarlZimmer, 09-12-2010, 02:12 PM
Letter from our President/CEO Karl Zimmerman:
We fully understand the severity of this situation, that this negatively affects your business, as this is our business, providing reliable connectivity and data center services. Our own business is also damaged greatly by these types of events, we fully understand that and feel that our stellar uptime performance up to this point is a testament to that.

In this case, I am gravely sorry that we have let you down and have not lived up to our promises. What I can do, is assure you that we are working as quickly and efficiently as possible on this matter. At the beginning of the project we had assured all of our in-house engineering staff was on-site for the maintenance, along with 3rd party network engineering for support and additional supervision of the project. We thought we had everything prepared and had spent weeks in configuration and testing, but it appears we were wrong. I dont need to tell you, but things did not go smoothly. Overall, we faced many instances of undocumented differences in the handling of standardized protocols. During this time, we worked extensively with both Cisco and Brocade engineers. At this point, it is too early to tell whether it is Cisco or Brocade at fault for those issues, but in the end, we understand that the network reliability is fully our responsibility, thus we are to blame.

Once this maintenance is finally completed, we will have significantly improved capabilities, performance, and expandability. With the gear well have another maintenance like this would not be required for a decade, if not more. You have nothing to fear of this being a continuous issue or a recurring event. It was a one-time major network infrastructure improvement.

I am asking you to please stick with us through these times. We have provided robust and friendly service up to this point, dont let this one incident, even though it is a huge incident, destroy the quality business relationship we have together. If you help us through this time with your understanding, we can assure you we are still committed to providing a steadfast and reliable service and this will pay off long-term dividends.
- Karl Zimmerman

I agree, this outage is unacceptable. I am posting this here myself so we can accept our blame and take our shots. We deserve it for this one. You cannot understand how bad I feel for this and how sorry I am for the customers this has hurt.
Posted by Mekhu, 09-12-2010, 02:25 PM
Karl,

I'm not here to whine or complain. Your server has been AMAZING over the years. We're more curious when we can expect a perm. fix to this?
Posted by David, 09-12-2010, 03:03 PM
Alright, enough of the ****ing circus act.
Facility is 100% offline again.
Posted by PersonalJ, 09-12-2010, 03:09 PM
I thought the maintenance was only related to future hosting, I was not aware all of steadfast.net had lost connectivity. It's a bit strange when my services at FDC have better uptime than the VPS at steadfast...
Posted by AHDOnline, 09-12-2010, 03:10 PM
Im sorry but this is the most stupid line ever
"Once this maintenance is finally completed, we will have significantly improved capabilities, performance, and expandability. With the gear we’ll have another maintenance like this would not be required for a decade, if not more. You have nothing to fear of this being a continuous issue or a recurring event. It was a one-time major network infrastructure improvement."

Dont worry you wont have 15 hour downtime again, so that makes this one alright..

Your updates need working on they are horrible. And it took YOU 12 hours to say anything..
Posted by VINAX, 09-12-2010, 03:10 PM
The network is down again. What's going on?

Edit.
Looks like it backs up now! Hope everything restores ASAP.
Posted by zenex5ive, 09-12-2010, 03:22 PM
Did anyone get their maintenance notice?
Posted by AHDOnline, 09-12-2010, 03:24 PM
It was said to have went out on august 23rd i think
Posted by Mike343, 09-12-2010, 03:32 PM
So it seems the network went down a while ago and it is coming close to 9 hours, at least for me?
Posted by RodrigoBR, 09-12-2010, 03:34 PM
My two servers down for hours, almost all day now (only a little moments of server working).

I was aware of the maintenance, but now I'm having serious problems because of this long delay.

Too much downtime = unhappy customers = less money for me

I know that Stedfast was doing a good job in the last months, but this is a big problem now.

Waiting for a final resolution...

Best Regards,
Rodrigo
Posted by AHDOnline, 09-12-2010, 03:35 PM
you're lucky if you are only noticing 9 hours..
Posted by drmorley, 09-12-2010, 03:54 PM

Quote:

Originally Posted by AHDOnline

you're lucky if you are only noticing 9 hours..

Here's my traffic monitor--looks like it's been bouncing since 6am CST.
Posted by SuperJ, 09-12-2010, 03:58 PM
Does anyone know what's going on with FutureHosting in Chicago? My server has been failing for at least six hours. I've tried to contact admins but there's no response from them. FTP seems to work but is very sluggish...
Posted by AHDOnline, 09-12-2010, 03:58 PM
well I guess its how you look at it. My first alert of downtime was at 2:26 with a FEW very small uptime alerts. It looks like the majority of the servers did go offline around 5am EST
Posted by David, 09-12-2010, 03:59 PM
They're utilizing steadfast.net which has been on and off since midnight see: http://www.webhostingtalk.com/showthread.php?t=980579
Posted by futurehosting, 09-12-2010, 03:59 PM
There aren't any open tickets - if you submitted one, you should have received a response within a couple minutes, but yes, take a look at this thread for information: http://www.webhostingtalk.com/showth...wpost&t=980579
Posted by robbyhicks, 09-12-2010, 04:12 PM
What's the deal with SLA Credit on this?
Posted by SuperJ, 09-12-2010, 04:17 PM
No, I haven't gotten any response. This is a copy of my email:

Server not responsive.Sunday, September 12, 2010 2:31 PMFrom: "Jacek Nabywaniec" <techie_jn@yahoo.com>Add sender to ContactsTo: "Future Hosting" <support@futurehosting.com>Hello,
My sites do not respond to http calls. FTP connected but sluggish.
www.hockey.pl
www.polville.com
www.photojack.info

It's been an hour and a half since the email but I have a guy working for me who says the server has been down since about 10:00 EST. I understand that it was more unreacheable than down.

Anyways, you guys run your service well, overall. I'd expect you to put a message on futurehosting.com where I could read a few lines about when the service should be back up and things would be a lot simpler.
Posted by bmhrules, 09-12-2010, 04:18 PM

Quote:

Originally Posted by unrealized

What's the deal with SLA Credit on this?

This Service Level Agreement does not cover outages due to scheduled or emergency network and/or facility maintenance, which will be broadcast to all customers in advance via the web page at https://support.steadfast.net/?_m=news&_a=view, and will not exceed 60 minutes per month.

So I guess that since it was down for more than one hour, we get credit for it?
Posted by futurehosting, 09-12-2010, 04:21 PM
If you login the customer portal, you will see the announcement listed in bright blue actually.

We don't have any tickets under your account - ticket's cant be opened via e-mail. If you go to my.futurehosting.com you will be able to login and submit a ticket. I'll open a ticket under your account so you can respond to that via e-mail.
Posted by GCM, 09-12-2010, 04:21 PM

Quote:

Originally Posted by SuperJ

No, I haven't gotten any response. This is a copy of my email:

Server not responsive.Sunday, September 12, 2010 2:31 PMFrom: "Jacek Nabywaniec" <techie_jn@yahoo.com>Add sender to ContactsTo: "Future Hosting" <support@futurehosting.com>Hello,
My sites do not respond to http calls. FTP connected but sluggish.
www.hockey.pl
www.polville.com
www.photojack.info

It's been an hour and a half since the email but I have a guy working for me who says the server has been down since about 10:00 EST. I understand that it was more unreacheable than down.

Anyways, you guys run your service well, overall. I'd expect you to put a message on futurehosting.com where I could read a few lines about when the service should be back up and things would be a lot simpler.

Your sites seem to be loading now. Best of Luck to FutureHosting and SteadFast recovering.
Posted by David, 09-12-2010, 04:28 PM

Quote:

Originally Posted by GCM

Your sites seem to be loading now. Best of Luck to FutureHosting and SteadFast recovering.

None of them load here, along with 85% of the planet. They're working in extremely limited areas.
Posted by IGXHost, 09-12-2010, 04:29 PM
Ah, that explains why our VPS with them was going up and down.
Posted by NetDepot - Terrence, 09-12-2010, 04:30 PM
The whole network is still down? how long was it down for?
Posted by SuperJ, 09-12-2010, 04:34 PM
Nothing works for me. Not even FTP. I'm in Toronto. My guy in Europe says the service is blazing fast.
Posted by GCM, 09-12-2010, 04:36 PM

Quote:

Originally Posted by David

None of them load here, along with 85% of the planet. They're working in extremely limited areas.

Ah, I was only testing from our San Francisco network and Detroit VPN. (Pretty fast might I add)
Posted by AHDOnline, 09-12-2010, 04:36 PM
Man, see how many times you could have implemented a back-out plan? Assuming there was one...
Posted by Deathspawner, 09-12-2010, 04:39 PM
This downtime hurts, big time. I'll be the first to say that my experience with Steadfast up to this point has been nothing short of incredible, but this kind of downtime is hard to bear. Worst still, I didn't receive an e-mail regarding this maintenence at all. At least if I had a heads-up, I wouldn't have been scurrying all over the place wondering what's going on.

The official site should have been updated with updates, not just this forum thread. Not everyone immediately heads here to get updates.
Posted by AHDOnline, 09-12-2010, 04:43 PM
They did sorta, kinda, pretend to have an update page. It if you can still get to the site on support.steadfast.net there were a few updates as to its not working no eta, three hours later: not working no eta
Posted by HiDef-Laws, 09-12-2010, 05:00 PM
I don't delete any email and I don't filter any out either. I just did a search for all Steadfast emails and there was not a notice of this sent to me. Some downtime is understandable for the infrastructure upgrades taking place but this has gotten a bit ridiculous. Hopefully service is returned within the next couple of hours.
Posted by cshenderson, 09-12-2010, 05:07 PM
Ive been extremely happy with steadfast .. until now.
It's not the necessarily the outage; which as we all know crap happens, and sometime the backout plan leads to more more long term delays and huge overhead costs.
What is disappointing is that Steadfast, like all other companies we critically rely on (like Covad or even Comcast Business) just will not set up a warning system. In the age of Email, IM, Text... a dozen and a half different ways of contacting people, nobody will do it; even power companies rely on people calling it to inform them something went terribly wrong during maintenance.
Learning of the outage by one of my clients calling me is just plain BAD.
Me coming into my small business only to learn that our ISP did maintenance and hour before and is still down is BAD.
Why oh why can't we get decent warnings from companies(not just Steadfast) instead of having to constantly troll their boards for maintenance schedules?
Posted by HD-Sam, 09-12-2010, 05:10 PM
16 Hours and counting .... We cannot access our servers at Steadfast. Their site is down, support site is down, and their phone lines are down (at least for me, at some ISPs their site still up it seems). The premium we are paying for this is outrageous.
Posted by VINAX, 09-12-2010, 05:16 PM

Quote:

Originally Posted by HiDef-Laws

I don't delete any email and I don't filter any out either. I just did a search for all Steadfast emails and there was not a notice of this sent to me. Some downtime is understandable for the infrastructure upgrades taking place but this has gotten a bit ridiculous. Hopefully service is returned within the next couple of hours.

Did you subscribe to the news at their support system?
https://support.steadfast.net/index.php?_m=news&_a=view

If not, enter your email and submit it.
Posted by SuperJ, 09-12-2010, 05:17 PM
Any idea what all this means, guys? I don't mean to be a pest but Sunday is prime time for me. Hockey games have finished hours ago and there's no stats because I can't access anything from Toronto, while my audience in Poland can. I'm looking pretty bad right now. Any idea when Toronto traffic will make it through?
Posted by VINAX, 09-12-2010, 05:20 PM

Quote:

Originally Posted by HD-Sam

16 Hours and counting .... We cannot access our servers at Steadfast. Their site is down, support site is down, and their phone lines are down (at least for me, at some ISPs their site still up it seems). The premium we are paying for this is outrageous.

Their site and support system didn't down because it hosted on a different network.
Posted by GCM, 09-12-2010, 05:21 PM

Quote:

Originally Posted by SuperJ

Any idea what all this means, guys? I don't mean to be a pest but Sunday is prime time for me. Hockey games have finished hours ago and there's no stats because I can't access anything from Toronto, while my audience in Poland can. I'm looking pretty bad right now. Any idea when Toronto traffic will make it through?

No idea, what you can do it use a proxy in the meantime. http://www.hidemyass.com/proxy-list/
Posted by HiDef-Laws, 09-12-2010, 05:23 PM
My service is now back online. YEAH!
Posted by Deathspawner, 09-12-2010, 05:24 PM
My site is also back online. Whew.
Posted by HiDef-Laws, 09-12-2010, 05:25 PM

Quote:

Originally Posted by VINAX

Their site and support system didn't down because it hosted on a different network.

It was up for me earlier today but has been down for about 2 hours now (both their main site and their support site). Neither DNS of steadfast.net nor support.steadfast.net resolve or respond to ping.
Posted by Mike343, 09-12-2010, 05:29 PM
Still facing downtime here.
Posted by VINAX, 09-12-2010, 05:29 PM

Quote:

Originally Posted by HiDef-Laws

It was up for me earlier today but has been down for about 2 hours now (both their main site and their support site). Neither DNS of steadfast.net nor support.steadfast.net resolve or respond to ping.

It's weird. Their site is always up for us.
Posted by HD-Sam, 09-12-2010, 05:30 PM

Quote:

Originally Posted by VINAX

Their site and support system didn't down because it hosted on a different network.

Not true. I thought it was on a different network as well. But... I tried it on my own ISP, my sprint phone, an AT&T phone, and the ultimate test was I verified it with Alertra.com's spotcheck. It was down on all, and Alertra.com listed it down at 6 of 10 of their worldwide locations.
Posted by cshenderson, 09-12-2010, 05:30 PM
I thought I had signed up for the Steadfast support alerts. I guess my ignorance of this upgrade is my fault.
I still maintain that most critical ISP don't seem to have warning systems in place but, in this case I seem wrong.

To jump into the other conversations: Their support site was down for me across 3 different ISPs in the chicagoland area.
Posted by Mekhu, 09-12-2010, 05:30 PM
We run VOIP servers on Steadfast and haven't been blocked out totally. It's just a matter of up and down like a yo yo all day.

My last pingdom down report came in at 5:13pm EST (15 minutes ago). Lets hope that was the last one... receiving 200+ of those in an evening just sucks.

The only thing I want an answer to is notification. Why in the $%#^ were we not notified of this maintenance window? That to me screams SLA credit. If I was aware, I'd have no issue with you copying TOS but with no notice comes some responsibility. You made EVERY business that relies on you look like amateurs today with this stunt.
Posted by chopsmidi, 09-12-2010, 05:31 PM
Been down for me now for half hour. Was stable before then for about an hour and a half.
Posted by Mekhu, 09-12-2010, 05:31 PM

Quote:

Originally Posted by cshenderson

I thought I had signed up for the Steadfast support alerts. I guess my ignorance of this upgrade is my fault.
I still maintain that most critical ISP don't seem to have warning systems in place but, in this case I seem wrong.

Signup for what? Since when must you signup with a provider to receive important (service impacting) notices. That's rediculous.
Posted by Deathspawner, 09-12-2010, 05:33 PM

Quote:

Originally Posted by Mekhu

Signup for what? Since when must you signup with a provider to receive important (service impacting) notices. That's rediculous.

+1. I was thinking the same thing. It should be a given.

Edit: And I spoke too soon... our site is down again. Ugh.
Posted by crazylane, 09-12-2010, 05:37 PM
oops, down again.
Posted by HD-Sam, 09-12-2010, 05:39 PM

Quote:

Originally Posted by cshenderson

I thought I had signed up for the Steadfast support alerts. I guess my ignorance of this upgrade is my fault.
I still maintain that most critical ISP don't seem to have warning systems in place but, in this case I seem wrong.

To jump into the other conversations: Their support site was down for me across 3 different ISPs in the chicagoland area.

They sent out an email August 24th about the maintenance but never did I once think it would be an outage like this
Posted by RodrigoBR, 09-12-2010, 05:43 PM
Still down for me.
Posted by Mekhu, 09-12-2010, 05:44 PM
Confirmed down since 5:34pm EST. I have all of our clients updated. Time to go make some sweet BBQ and forget about this for a bit.
Posted by HiDef-Laws, 09-12-2010, 05:54 PM
Was up shortly and down again. :/

Up again. Guess this is to be expected for 12-24 hours.
Posted by ManagerJosh, 09-12-2010, 06:03 PM
All our latest updates on the network maintenance can be found at https://support.steadfast.net/index....ews&newsid=285

I'm sorry for any problems and disruptions we are causing to your business.
Posted by Mekhu, 09-12-2010, 06:05 PM

Quote:

Originally Posted by ManagerJosh

All our latest updates on the network maintenance can be found at https://support.steadfast.net/index....ews&newsid=285

I'm sorry for any problems and disruptions we are causing to your business.

sigh... you just had to post it didn't you

"Too many connections".

If anything new comes into that site, someone be sure to copy/paste here for us all!
Posted by HD-Sam, 09-12-2010, 06:09 PM
It says too many connections now... but I have my last refresh from about 30 minutes ago:

2:29 PM: Restoration is taking longer than anticipated. ETA revised to 4:30PM CDT
2:29 PM: ETA is about 30m-1h. We'll fully update when things are back online, but it's expected we'll see a full restoration of connectivity before 3:30PM CDT
11:32 AM: Core routers are both online. We're bringing things back VLAN by VLAN; as soon as you're back online we'll email replies to existing tickets to confirm that your connectivity has been restored.
7:38 AM: We are continuing to work on outstanding issues caused by the router replacements and currently have no ETA on resolution.
6:06 AM: The problems being caused by the compatibility issue seem to have been solved by removing core4 and replacing it with the Brocade. We are still diagnosing some of the remaining issues and hope to have them resolved shortly.
5:15 AM: The issue was traced to a compatibility issue that prevents the Brocade and Cisco from working together and we are going ahead with completing the replacement of core4 at which point we will resolve all outstanding issues. We are expanding the maintenance period by one hour to deal with this problem.
4:58 AM: We are continuing to work on solving the problems caused by the new router with Brocade. We are attempting several possible solutions at this time and will report when we have additional information.
4:00 AM: We are currently working with Brocade support to attempt to resolve some of the issues that have resulted from the upgrade. We will provide an update as soon as we can.
3:23 AM: We are currently diagnosing known issues with packet loss that are affecting most customers. We will not proceed with the core4 replacement until this problem is resolved.
2:30 AM: core3 has been replaced and we are currently working on bringing the new Brocade router into normal operation status and solving a few complications from the change before we begin the replacement of core4. We will provide an update when this step is complete.
12:53 AM: We are now proceeding with the removal of core3 and replacement with its Brocade equivalent. We will provide an update if there are any issues with this process or when the replacement is completed.
12:25 AM: We are currently verifying configuration settings on the routers to reduce the risk of errors that may cause transition issues. We will provide an update when we are ready to begin the physical migration.
11:39 PM: The maintenance is on schedule to begin at 12:00 AM.
Posted by ManagerJosh, 09-12-2010, 06:10 PM

Quote:

Originally Posted by Mekhu

sigh... you just had to post it didn't you

"Too many connections".

If anything new comes into that site, someone be sure to copy/paste here for us all!

I will do my best to post updates as fast as I receive them.
Posted by Mekhu, 09-12-2010, 06:10 PM
Thanks Sam. I was hoping to see another update after the 2:29 posting but I'll remain patient. Thank GOD we have some amazing clients.
Posted by MattE, 09-12-2010, 06:11 PM
We noticed major packet loss starting at around 3AM, lasting until about 7AM when everything completely died, and we've been offline since (now it is 5PM). We have had an exceptional service with Steadfast up to this point, though. We are still very happy with the service, though hopefully there is some form of credit offered for this downtime.
Posted by Mike343, 09-12-2010, 06:13 PM
it seems that ETA was too early as 4:30CDT has come and gone :/
Posted by TopekaHost, 09-12-2010, 06:14 PM
We just started with Geekstorage.com and they use Steadfast Networks for their data center this has been awful we just started in the business and I as an owner called all my clients in our offline database since almost all our data was on the server so at one time I decided to make an access database on our computers so we can call clients in the event our hostbill is offline.

So by all means if anyone hears or see status update data by all means please post thanks again sincerely from my heart to yours
Posted by Deathspawner, 09-12-2010, 06:18 PM
The only new update there is this:

5:210 PM: Key components are now working, services should be coming up for most customers shortly.

Our site keeps going up and down sporadically though.
Posted by TopekaHost, 09-12-2010, 06:26 PM
In no circumstances did I expect to be almost 18 hours now my site went offline at 12AM and has been off since then my entire vps is offline thanks to them.
Posted by KarlZimmer, 09-12-2010, 06:30 PM
Wow, sorry for the delay. We've been actively working on this and even worked out a plan to revert back to the old setup, but when we started getting things together both the chassis and a line card refused to work again... NOTHING has been going right today, it is just crazy.

Well, things ARE starting to stabilize. There will still be odds and ends to tie up and likely still some BGP convergence type stuff, but stabilizing for the most part.
Posted by HD-Sam, 09-12-2010, 06:34 PM

Quote:

Originally Posted by KarlZimmer

Wow, sorry for the delay. We've been actively working on this and even worked out a plan to revert back to the old setup, but when we started getting things together both the chassis and a line card refused to work again... NOTHING has been going right today, it is just crazy.

Well, things ARE starting to stabilize. There will still be odds and ends to tie up and likely still some BGP convergence type stuff, but stabilizing for the most part.

Our servers are starting to reappear. Hope it stays. Thanks for the update.
Posted by ManagerJosh, 09-12-2010, 06:34 PM

Quote:

Originally Posted by TopekaHost

In no circumstances did I expect to be almost 18 hours now my site went offline at 12AM and has been off since then my entire vps is offline thanks to them.

I'm sorry you've had a horrible first impression with us.

I know how important our services are to the success of your organization, and we will do whatever it takes to restore service.
Posted by TopekaHost, 09-12-2010, 06:35 PM
So Karl,

When will this outage finally be over when will we have access to our systems again.
Posted by KarlZimmer, 09-12-2010, 06:36 PM

Quote:

Originally Posted by Steven

Karl,
Did your network admins have any prior experience with brocade as a core routing platform?

Did they just start learning it 3 weeks ago?

When you said you did testing, what kind of testing?

Did you actually setup some bgp sessions and put some load on them?

Is it true that you did not have onsite network admins? If so what did it take so long to get someone on site?

What measures are you going to take to prevent outages in the future? Your network administration teams brocade experience is obviously lacking.

Yes, our engineers have experience with brocade, though not extensive, but the 3rd party engineer we had in on-site has extensive Brocade experience.

We basically setup the boxes as part of a test environment with Cisco gear, though not the exact same gear, and did very similar configurations, etc. We did turn up BGP, and BGP was not the primary issue here.

We had our own network engineer on site as well as the 3rd party network engineer, they were the ones performing the maintenance. There was absolutely no delay anywhere for bringing someone on-site, everyone was already on-site.

Right now we're working out additional plans/changes regarding our overall network engineering and network configurations arrangement. I've been busy actually dealing with resolving this issue so don't have those plans finalized yet at this point.
Posted by David, 09-12-2010, 06:40 PM

Quote:

Originally Posted by KarlZimmer

We had our own network engineer on site as well as the 3rd party network engineer, they were the ones performing the maintenance. There was absolutely no delay anywhere for bringing someone on-site, everyone was already on-site.

Again, I find that extremely weird considering one of your own employees said otherwise just a few hours ago via IRC.

14:01 <@ub3r> okay, engineers are going to 350 right now.
14:02 <@ub3r> they were working remotely, going onsite now
14:02 <@David> ...
14:02 <@David> you. are.!@#$!@#$. kidding. me?
14:02 <@Steven> what the !@#$!
14:02 <@ub3r> What do you know about networking david?
14:02 <@David> nothing.
14:02 <@David> roughly the same as you folks.
Posted by TopekaHost, 09-12-2010, 06:40 PM
I can finally see steadfast core router but still not totally up but we are making progress
Posted by KarlZimmer, 09-12-2010, 06:42 PM

Quote:

Originally Posted by Steven

Several people I know did not receive emails.

It states in our welcome email:

Quote:

As many of our customers have expressed that they do not want to receive service notices to the primary contacts within their accounts, we do not notify customers of most routine maintenance and service changes via email. If you would like to be notified when we post an announcement, please visit the following link and enter your email address in the "Subscribe" box on the right side of the page:

https://support.steadfast.net/?_m=news&_a=view

You can also click on the "XML" link in the "Subscribe" box to access an RSS feed of announcements which you can subscribe to in your favorite RSS reader. We also provide links to recent announcements and company blog posts at the bottom of our front page if you prefer to check manually.

Then in the TOS:

Quote:

Scheduled maintenance is announced at https://support.steadfast.net/?_m=news&_a=view

We would love to mail everyone announcements, but we've gotten SO many complaints when we do...
Posted by David, 09-12-2010, 06:43 PM
There's quite a bit more as well, stating otherwise:

14:04 <@ub3r> well anywho, there isn't always a reason to be onsite
14:05 <@ub3r> we do have an out-of-band management system, and our IP-to-Serial
boxes are connected to that network.
Posted by TopekaHost, 09-12-2010, 06:44 PM
Hallelujah we are finally up and running thanks steadfast for connectivity.
Posted by KarlZimmer, 09-12-2010, 06:44 PM

Quote:

Originally Posted by David

Again, I find that extremely weird considering one of your own employees said otherwise just a few hours ago via IRC.

14:01 <@ub3r> okay, engineers are going to 350 right now.
14:02 <@ub3r> they were working remotely, going onsite now
14:02 <@David> ...
14:02 <@David> you. are.!@#$!@#$. kidding. me?
14:02 <@Steven> what the !@#$!
14:02 <@ub3r> What do you know about networking david?
14:02 <@David> nothing.
14:02 <@David> roughly the same as you folks.

That is just purely wrong and incorrect. I can't explain why he would have said that. We've had engineers on site since 10PM last night preparing for the maintenance and then performing it. I can show you badge access logs for the building if you really don't believe me...
Posted by Steven, 09-12-2010, 06:45 PM

Quote:

Originally Posted by KarlZimmer

As many of our customers have expressed that they do not want to receive service notices to the primary contacts within their accounts, we do not notify customers of most routine maintenance and service changes via email. If you would like to be notified when we post an announcement, please visit the following link and enter your email address in the "Subscribe" box on the right side of the page:

I honestly do not consider this to be routine maintenance or a simple service change. This is a serious change. There are several people I know that have had their lifes turned upside down with this downtime. As you may or may not know - today is an important day for fantasy football websites.
Posted by David, 09-12-2010, 06:47 PM

Quote:

Originally Posted by KarlZimmer

That is just purely wrong and incorrect. I can't explain why he would have said that. We've had engineers on site since 10PM last night preparing for the maintenance and then performing it. I can show you badge access logs for the building if you really don't believe me...

I believe you, I'd have no reason to suspect otherwise -- merely going based on what he's stated. I assumed Mike was in-the-know, considering he's a team member. At any rate, I'm sure you're busy -- you've got an awful lot more clients to **** over so I'll let you get back to work.
Posted by Deathspawner, 09-12-2010, 06:52 PM

Quote:

Originally Posted by KarlZimmer

It states in our welcome email:

Then in the TOS:

We would love to mail everyone announcements, but we've gotten SO many complaints when we do...

It would probably be easier to still have e-mails sent out by default, and just let those who don't want them to unsubscribe easily. I'd be willing to bet that there are a lot more people who would rather receive these e-mails than not receive them - given that their paychecks are likely to be the result of their site.

I find it strange that so many people would complain anyway, given the sheer amount of spam that reaches our inboxes (mine, anyway).
Posted by PersonalJ, 09-12-2010, 06:56 PM
I'm back online, I think I was down for roughly 18 hours. I did not notice the downtime until around 3 AM EST.
Posted by RodrigoBR, 09-12-2010, 07:02 PM
Nothing yet, my servers still down here...

I have large and important websites/customers in these servers, I'm losing money.
Posted by HiDef-Laws, 09-12-2010, 07:06 PM
Mine has been up and down for the past few hours. The down periods are a bit long for routing re-convergence but at least its not totally dead atm.
Posted by KarlZimmer, 09-12-2010, 07:13 PM

Quote:

Originally Posted by HiDef-Laws

Mine has been up and down for the past few hours. The down periods are a bit long for routing re-convergence but at least its not totally dead atm.

There are currently off and on memory issues with one of the routers and that is probably what you're seeing. We're working with the vendor on getting config properly transfered, replaced, etc.
Posted by RodrigoBR, 09-12-2010, 07:14 PM

Quote:

Originally Posted by HiDef-Laws

Mine has been up and down for the past few hours. The down periods are a bit long for routing re-convergence but at least its not totally dead atm.

For me I can see only a little moments up, totally unstable. But in the most of the time all is down.

Like in this moment, since my last post, still all services down.

Best Regards,
Rodrigo
Posted by dgessler, 09-12-2010, 07:14 PM

Quote:

Originally Posted by RodrigoBR

Nothing yet, my servers still down here...

I have large and important websites/customers in these servers, I'm losing money.

We're running high traffic e-commerce sites, we lost a ton of money because of this downtime.

Although we were not subscribed to their maintenance e-mail (we didn't really think to do so before), steadfast should have e-mailed everyone about this major scheduled downtime. Notified or not, 6 hours is still an unacceptable amount of time to be down and 17+ hours is epically unacceptable. This definitely warrants a change in datacenters,
Posted by Mekhu, 09-12-2010, 07:19 PM

Quote:

Originally Posted by KarlZimmer

It states in our welcome email:

Then in the TOS:

We would love to mail everyone announcements, but we've gotten SO many complaints when we do...

That's funny, my welcome email from your company (I signed up 4-5 years ago I think) says nothing of the sort.

Anyways, I'm not about to argue. I'm about done with your company as of right now and just want to forget this. I can only hope you're not a money hungry company and come good on some compensation for us all.

I think I pay about 2x more at Steadfast compared to our Dallas, New York, etc locations and I've never had an issue with that until now.

I'm still amazed I was sent no information about this. Maybe I'll have to start checking 10+ websites daily for updates from our DC's... yeah, that seems logical

BTW, still ******** network access from our end.
Posted by Steven, 09-12-2010, 07:21 PM
Its horrible. All day I was able to access it for the most part - been completely down for me for the last 2 1/2 hours.
Posted by menchibantam, 09-12-2010, 07:22 PM
I have been with steadfast for many years now because I like to think that if there are any problems they will be my fault only because steadfast is so good when it comes to handling their own problems. All our critical stuff is with them because they are the most reliable people with the nicest staff in the business as far as I'm concerned. I wonder if I have just gotten lucky all this time after this 18 hour downtime. I have lost a lot of advertising revenue and potential user upgrades today and it hurts more because Sunday is almost always our busiest day.
Posted by HD-Sam, 09-12-2010, 07:22 PM

Quote:

Originally Posted by dgessler

We're running high traffic e-commerce sites, we lost a ton of money because of this downtime.

Although we were not subscribed to their maintenance e-mail (we didn't really think to do so before), steadfast should have e-mailed everyone about this major scheduled downtime. Notified or not, 6 hours is still an unacceptable amount of time to be down and 17+ hours is epically unacceptable. This definitely warrants a change in datacenters,

Agreed. We've been down since 11:56pm last night and it is now under 6 hours until the 24 hour mark. Let's hope we don't reach that
Posted by Mekhu, 09-12-2010, 07:26 PM
Did the VPS machines need to FSCK? I'm lost as to why some of us are getting yo yo connections and others are 100% offline!?
Posted by HD-Sam, 09-12-2010, 07:30 PM

Quote:

Originally Posted by Mekhu

Did the VPS machines need to FSCK? I'm lost as to why some of us are getting yo yo connections and others are 100% offline!?

They shouldn't, this is a network issue. Karl mentioned:

Quote:

Originally Posted by KarlZimmer

There are currently off and on memory issues with one of the routers and that is probably what you're seeing. We're working with the vendor on getting config properly transfered, replaced, etc.

That and BGP convergence are most likely why we are going on/offline
Posted by Mekhu, 09-12-2010, 07:31 PM

Quote:

Originally Posted by HD-Sam

That and BGP convergence are most likely why we are going on/offline

Thanks for the reply. I thought FSCK was power loss only so that's good to know.

As for the on/offline, I understand that. Just confused why some have NO access at all.
Posted by HD-Sam, 09-12-2010, 07:36 PM

Quote:

Originally Posted by Mekhu

Thanks for the reply. I thought FSCK was power loss only so that's good to know.

As for the on/offline, I understand that. Just confused why some have NO access at all.

Ah, they also mentioned they were bringing things up VLAN by VLAN. They may have not gotten to yours yet.
Posted by KarlZimmer, 09-12-2010, 07:38 PM

Quote:

Originally Posted by HD-Sam

They shouldn't, this is a network issue. Karl mentioned:

That and BGP convergence are most likely why we are going on/offline

Yep, router going up and down causing BGP issues, etc. but there are a number of customers with their VLAN only on that router. We have people picking up a spare supervisor card to fix that right now.
Posted by KarlZimmer, 09-12-2010, 07:39 PM
If you have NO access please PM me and we'll look into that ASAP.
Posted by Deathspawner, 09-12-2010, 07:40 PM

Quote:

Originally Posted by KarlZimmer

If you have NO access please PM me and we'll look into that ASAP.

Does that mean that we're at the end of the downtime?
Posted by Mike343, 09-12-2010, 07:41 PM
There down here at 6:40 PM CST.
Posted by HiDef-Laws, 09-12-2010, 07:53 PM
My stuff has been up for the past 30 minutes or so with only 1 or 2 dropped packets. Things may be stabilizing on whichever network segment I'm currently attached.
Posted by drmorley, 09-12-2010, 08:09 PM
All my machines are still unreachable.
Posted by Mekhu, 09-12-2010, 08:16 PM

Quote:

Originally Posted by HiDef-Laws

My stuff has been up for the past 30 minutes or so with only 1 or 2 dropped packets. Things may be stabilizing on whichever network segment I'm currently attached.

Agreed. While I don't have our streaming service running, atleast my Pingdom notifications have stopped.
Posted by KarlZimmer, 09-12-2010, 08:16 PM

Quote:

Originally Posted by drmorley

All my machines are still unreachable.

You're homed on the switch that currently doesn't have a management card, our people are on the way back with the replacement. We weren't expecting hardware issues with two of our management cards, as they had been known working beforehand...

You'll be up once that is in, but I'd recommend you contact network operations once this is settled so you can have your VLAN setup with VRRP.
Posted by ManagerJosh, 09-12-2010, 08:21 PM
As of 6:10 PM Central Time: There are off and on memory issues with one of the core routers still causing some issues that we are working with the vendor to resolve, but everything else should be online. If not, please contact our support department.
Posted by jon222, 09-12-2010, 08:36 PM
Karl or someone else from Steadfast, are we going to get any kind of SLA credit for this extended amount of downtime?
Posted by Chrysalis, 09-12-2010, 08:39 PM
Karl the problem is still ongoing for me.

My server never actually went down for a long time but it has been going up and down like a yoyo all day long causing havoc with services I host.

A long term customer and dont plan on leaving however please next time in an emergency situation send out an email.

I see routing when down changing between ntt and nlayer.
Posted by KarlZimmer, 09-12-2010, 08:59 PM

Quote:

Originally Posted by jon222

Karl or someone else from Steadfast, are we going to get any kind of SLA credit for this extended amount of downtime?

SLA terms will be followed, just open a ticket with billing.
Posted by ManagerJosh, 09-12-2010, 09:21 PM
If you are still experiencing issues with your account, Please open a support ticket with us and one of our team members will investigate.
Posted by ManagerJosh, 09-12-2010, 09:34 PM
As of 7:20 PM: Core4 is actively being worked on and appears to be the last issue. Customers homed to Core4 only are down, but all other services should be normalized. If your account is still down, please contact our support department.
Posted by KarlZimmer, 09-12-2010, 09:37 PM
Parts are all in for core4 and in place, just finishing up configurations.
Posted by Deathspawner, 09-12-2010, 09:49 PM
I don't know what SLA is, but are regular customers going to see any retribution for this hassle? I don't run a site that drives sales like many others here, but I still lose money on advertising, and not to mention my traffic numbers for almost an entire day. Kind of hard to stomach.
Posted by ManagerJosh, 09-12-2010, 09:57 PM
SLA is service level agreement. Please see http://en.m.wikipedia.org/wiki/Servi...edirected=true for a better idea what SLA is.

As for compensation for the downtime, please open a support ticket with billing and they will take care of you.
Posted by RodrigoBR, 09-12-2010, 10:01 PM
I was posting about services back up, but I have still one server down.

How much time do you need to fix all issues?!?

I can't accept almost one full day of downtime.

Best Regards,
Rodrigo
Posted by ManagerJosh, 09-12-2010, 10:13 PM
Hi Rodrigo:

Please open a support ticket and one of our team members will investigate the matter in complete detail.

Thank you for your continued patience with us.
Posted by qlites, 09-12-2010, 10:17 PM
Server went down again. No response from support ticket.
Posted by qlites, 09-12-2010, 10:44 PM
Came back up for a minute and back down. This is becoming a very bad joke!
Posted by KarlZimmer, 09-12-2010, 10:51 PM
Now resolving some routing loops from the new setup.
Posted by HD-Sam, 09-12-2010, 10:58 PM
Servers are still going up and down, can't update our ticket because we can't access the steadfast support site either.
Posted by pfak, 09-12-2010, 11:00 PM

Quote:

Originally Posted by Mekhu

The only thing I want an answer to is notification. Why in the $%#^ were we not notified of this maintenance window? That to me screams SLA credit.

SLA credit? I'd be looking for new hosting, clearly Steadfast does not know how to operate a network.
Posted by Mekhu, 09-12-2010, 11:04 PM

Quote:

Originally Posted by pfak

SLA credit? I'd be looking for new hosting, clearly Steadfast does not know how to operate a network.

Unfortunately I'm not an *******. Steadfast has given myself and my clients years of uninterrupted service without issues. I still trust them. Just want this over and things back to the normal reliable ways.

But yes, I agree this is more than a small mixup. I'm pretty amazed and can only hope this hit them in the Wallets hard so they learn their lesson.
Posted by KarlZimmer, 09-12-2010, 11:08 PM

Quote:

Originally Posted by pfak

SLA credit? I'd be looking for new hosting, clearly Steadfast does not know how to operate a network.

And you don't know the full situation. This is the only major network outage in 3+ years at my last count. This was a MAJOR undertaking that we mis-evaluated. Yes, we made a mistake and were not fully prepared to make this migration, but there are MANY extenuating circumstances that have brought this issue to the point it is now at.
Posted by panopticon, 09-12-2010, 11:09 PM
I feel bad that this went wrong for them; until today's network incident, they've provided me with 100% uptime and excellent service.
Posted by Deathspawner, 09-12-2010, 11:15 PM

Quote:

Originally Posted by Mekhu

Unfortunately I'm not an *******. Steadfast has given myself and my clients years of uninterrupted service without issues. I still trust them. Just want this over and things back to the normal reliable ways.

I have to agree with this. I was going to e-mail them regarding some sort of credit, but I said screw it. The company has helped me out in rather big ways in the past and didn't rape my wallet, and that's important. Plus, the service has always been fast and thorough, so I'm not about to let one screw-up (albeit a big one) cause me to move.

Accidents happen, I guess.
Posted by David, 09-12-2010, 11:20 PM

Quote:

Originally Posted by Mekhu

Unfortunately I'm not an *******. Steadfast has given myself and my clients years of uninterrupted service without issues. I still trust them. Just want this over and things back to the normal reliable ways.

Why, is the question? Why still trust?
I'm afraid at least in my case, I've lost every little ounce of faith I once allocated to Steadfast & Karl. Something to consider is how companies respond during the worst of times -- not how great and fantastic things are on the average day.

This wasn't your average day, and has left you without 99.9% uptime for the next year and a half. Though I've managed to successfully wield it into a situation where my clients have the utmost faith in my own service, steadfast did the exact opposite to me today.

Their responses and communication via every method they utilized (the few they did) wasn't thorough, and the odd time we received an update beyond Karl's "letter from the CEO" was absolutely !@#%$ing useless. The lack of thoroughness contained in the announcement / news from Steadfast was absolutely appalling, with hours between updates you would think more than a single line could be compiled, especially after ~nineteen hours of downtime and maintenance.

Quote:

7:20 PM: Core4 is actively being worked on and appears to be the last issue. Customers homed to Core4 only are down, but all other services should be normalized. If your account is still down, please contact our support department.
6:10 PM: There are off and on memory issues with one of the core routers still causing some issues that we are working with the vendor to resolve, but everything else should be online. If not, please contact our support department.
5:10 PM: Key components are now working, services should be coming up for most customers shortly.
2:29 PM: Restoration is taking longer than anticipated. ETA revised to 4:30PM CDT
2:29 PM: ETA is about 30m-1h. We'll fully update when things are back online, but it's expected we'll see a full restoration of connectivity before 3:30PM CDT

Seriously? The times also changed on that 2:29 to 2:29 item as well. Every two hours I sent five paragraph emails to my clients explaining or expanding on what little data I was receiving from steadfast. This isn't service we can easily pass off to our clientele, especially not if you're an avid fan of thorough communication. Steadfast has proven to me today that they're not a company that not only doesn't deserve a damn cent of my client's funds, but aren't interested in working for it.

Though I'm sure you've all heard enough of my disapproval, I'll make an exit now.
God bless.
Posted by kaniini, 09-12-2010, 11:24 PM

Quote:

Originally Posted by KarlZimmer

And you don't know the full situation. This is the only major network outage in 3+ years at my last count. This was a MAJOR undertaking that we mis-evaluated. Yes, we made a mistake and were not fully prepared to make this migration, but there are MANY extenuating circumstances that have brought this issue to the point it is now at.

What about the times last year and the year before when your distribution network completely failed?

This is not the first serious outage you guys have had, but I do agree that the changes you are making are for the better.
Posted by jon222, 09-12-2010, 11:27 PM
We do not know the extenuating circumstances because all your status page has told us is give us 2 more hours for the past 14 hours. Oh and also everything is fixed except where it isn't.
Posted by panopticon, 09-12-2010, 11:45 PM
Despite the issue today, I plan to stay with steadfast for a long while. Their staff, service, hardware, and network have been excellent for me to date; everyone can make a mistake once.
Posted by kdill, 09-12-2010, 11:46 PM
Everyone has their days. I too lost money today, but I have lost MORE money jumping from host to host looking for a good one. I will continue my service with them for as long as I can afford it. They are number one in my book and will stay that way. 100percent uptime or not. I have faith this will never happen again as I have seen nothing but wonderful from them up till today and it's almost behind us.
Posted by Steven, 09-12-2010, 11:51 PM

Quote:

Originally Posted by KarlZimmer

And you don't know the full situation. This is the only major network outage in 3+ years at my last count. This was a MAJOR undertaking that we mis-evaluated. Yes, we made a mistake and were not fully prepared to make this migration, but there are MANY extenuating circumstances that have brought this issue to the point it is now at.

http://www.webhostingtalk.com/showth...ight=steadfast

That was a pretty bad string of outages - roughly 2 years ago.
Posted by jon222, 09-12-2010, 11:58 PM
You guys act like this was emergency downtime. Nobody imposed this on them and it was supposed to be network maintenance. They promised no more than 30 minutes of downtime for any customer. Their new equipment messed up, I wouldn't complain about that as new equipment will give you problems we all know. Why wasn't everything rolled back to the previous configuration while they figured out what went wrong? Should we suffer 24 hours of on-off downtime for what started as network maintenance? The lack of contingency for this is what saddens me.

All that said they are still offer great service every other day, I am really stressed out because I personally can't do anything about this type of downtime and my business is suffering because of it.
Posted by NetDoc, 09-13-2010, 12:25 AM
Youknow, I read all 7 pages of hurt, angst and betrayal. I must say that I am happy that I chose Atlantic.net. Good luck peeps.
Posted by KarlZimmer, 09-13-2010, 12:27 AM

Quote:

Originally Posted by jon222

You guys act like this was emergency downtime. Nobody imposed this on them and it was supposed to be network maintenance. They promised no more than 30 minutes of downtime for any customer. Their new equipment messed up, I wouldn't complain about that as new equipment will give you problems we all know. Why wasn't everything rolled back to the previous configuration while they figured out what went wrong? Should we suffer 24 hours of on-off downtime for what started as network maintenance? The lack of contingency for this is what saddens me.

All that said they are still offer great service every other day, I am really stressed out because I personally can't do anything about this type of downtime and my business is suffering because of it.

We DID roll it back. We rolled it back and then the chassis we were using could no longer be powered on, one primary supervisor engine could be powered on in any of three chassis, and our backup supervisor card was having odd memory issues. Nothing was going right, no matter what we did there was another issue.

We'll have a full explanation out tomorrow along with resolutions, etc.

To note, the routing issues should be over, if you still see anything, please send us a ticket with a traceroute.
Posted by jon222, 09-13-2010, 12:32 AM
I'm sorry Karl, I did not know that, I'm only going on what the status page announced which made it sound like you guys just persevered. My bad.
Posted by Steven, 09-13-2010, 12:55 AM
I see what the problem is..

Quote:

In the end, we selected the Brocade (formerly Foundry) MLXe-16, the newest member of the MLX family that isn't planned for official release until mid-September.

That there - I consider this beta hardware. What in the world were you thinking?
Why did you break what was working and proven for something that is unproven?
Posted by misterd, 09-13-2010, 01:39 AM

Quote:

Originally Posted by Steven

....
Why did you break what was working and proven for something that is unproven?

I remember while I was touring DCs several years ago, some of them had "unreleased" core routers as well. -- Now, I don't know if it's standard practice for manufacturers, providers has good connection with them, or they just need some lab rats before official release.

I'm stressed out on this situation and lost quite a bit of money & repuation as well. While 24-hour maintenance is not acceptable, but this thing happens and I believe they handled it the best they can.

P.S. - My VLAN has been stable for almost 6 hours (fingers crossed). I had problem reaching some networks a few hours earlier, but things now appear to be normal.
Posted by joshribakoff, 09-13-2010, 01:43 AM
I understand issues happen, but there are multiple factors at play

1) They sent an email reminding us about the scheduled maintenance , yet did not send one notifying when it went awry. This provided false intelligence that prolonged our decision to jump ship. We regret this.

2) They did not attempt to roll back / did not have a solid enough rollback plan in place. OK technology messes up, but when you have a data center to run you shouldn't go playing around with beta hardware without a solid rollback procedure

3) Just all around lack of response. Phone circuits jammed. No emails. False timelines, false intelligence given out, crappy SLA policy.

I've spent $2,000+ and all I'm going to get is $30 credit, if that. I already setup a new account at the planet, the cloud hosting is running about 10x faster on page loads than steadfast was.

I always liked working with steadfast because they're a small company, Karl answered the phone personally. But this is unforgivable, while my site was down and I was spending hours frantically pounding on my keyboard, they prolonged the outages for 17hrs (regardless of what they will say). The line on his site that says "wont need an upgrade for another decade" is the kicker. This is a huge bluff, outright lie. You're just saying that to kiss up to us, you can't make an assertion like that. When stuff goes wrong you shift the blame onto the hardware, but at the same time assure us how good the hardware is. I can't continue to do business with a company that has no cognitive dissonance. It reminds me of the infamous Bush quote "fool me once, fool me twice"
Posted by kdill, 09-13-2010, 01:58 AM
Hey life happens. Guess what? It happened to Karl also. He knows the severity of the situation. I mean, what do you tell your clients!? He knows there are going to be people like you that just rage out and leave, but he has to do something. When I read that comment, I had the common sense to see that he only meant to keep people relaxed about the future, not to literally say we are never going to touch our **** for 10 years.
Posted by joshribakoff, 09-13-2010, 02:12 AM

Quote:

Originally Posted by kdill

I had the common sense to see that he only meant to keep people relaxed about the future, not to literally say we are never going to touch our **** for 10 years.

Shouldn't we be able to take his statements literally? I hired him to host my website, not soothe me. So when he said service was restored at 5am, and it was not - he was not being literal. When he says service is now restored, hes wrong again is what you're implying. I'd rather be given a literal statement than a false sugar coated one. Coupled with his lies here about never having 'major' [1] down time, which was disproved by other long time customers (to which I can also personally attest to). The statements permeate his true attitude towards the company, which is to keep BSing us. I was plenty relaxed, all day in fact. Until I read that statement, then my blood boiled. Its like 'drill baby drill' in that we are being assured the inevitable will not happen, when everyone knows that it will. My new host isn't making me false promises.

[1] - (of course he injects the subjective term for plausible deny-ability)
Posted by ManagerJosh, 09-13-2010, 02:29 AM

Quote:

Originally Posted by Steven

I see what the problem is..

That there - I consider this beta hardware. What in the world were you thinking?
Why did you break what was working and proven for something that is unproven?

Note that I am not using this as an excuse or justification for our situation, but the point is that something as routine and simple may not always work out as intended.

One incident that comes to mind was the recent updates McAfee released for their Antivirus solution. However by days end of that release, system administrators had hundreds of computers with corrupt installations of Windows because of a corrupt patch.

Something as mundane and done on a weekly, if not daily basis by McAfee, had issues and caused problems.
Posted by jon222, 09-13-2010, 02:29 AM
I wouldn't go that far, their uptime has been excellent bar that previous incident, it can be easy to forget how long ago it was because I thought it was longer ago too.

kdill asks what do you tell your clients?

- Detailed reports of what is going on at the time, not one liners
- Accurate timelines, Don't claim service is resumed if it isnt because that just creates more probems
- Refund this month proactively so people don't have to beg. This gesture of good will would probably save some customers because I know most people will just write it off and leave. I know I'm not looking forward to tomorrow when I message them to have them determine whether or not this even constitutes a credit (not hopeful due to the response that karl gave earlier)
Posted by joshribakoff, 09-13-2010, 02:35 AM

Quote:

Originally Posted by ManagerJosh

the point is that something as routine and simple may not always work out as intended.

Thanks captain obvious ;-) The problem with that is your website currently implies that is not possible. "You have nothing to fear of this being a continuous issue or a recurring event." So either you're not going to touch any settings or hardware for the next 10yrs, or you guys are very misleading to us.

11 PM and they're still having "isolated" incidents. Coincidentally I have just finished moving all over my important sites to "the planet" cloud hosting
Posted by pfak, 09-13-2010, 02:35 AM
Apparently Steadfast while claiming to have their network back up, is not actually back up. A number of customers I know do not have service still, and support tickets are not being responded to.

24 hours and counting.
Posted by ManagerJosh, 09-13-2010, 02:38 AM
@joshribakoff - All I can say is I'm sorry. I'm sorry for all the problems we caused you and that you feel your trust has been misplaced.

I do hope you will continue placing your trust in our services and that we will be able to demonstrate why we deserve your trust.
Posted by ManagerJosh, 09-13-2010, 02:41 AM

Quote:

Originally Posted by pfak

Apparently Steadfast while claiming to have their network back up, is not actually back up. A number of customers I know do not have service still, and support tickets are not being responded to.

24 hours and counting.

Hi pfak:

If you, or any Steadfast customer, is still experiencing issues with their service, please do not hesitate to open a support ticket.

Please include in the ticket a traceroute and we'll work on getting it resolved immediately.
Posted by joshribakoff, 09-13-2010, 02:44 AM

Quote:

Originally Posted by ManagerJosh

@joshribakoff - All I can say is I'm sorry. I'm sorry for all the problems we caused you and that you feel your trust has been misplaced.

That's a lie. There's something else you could say, for example
"We are officially retracting our statements that 'there is nothing to worry about', in fact things can wrong in the future, although we will do our best. We want you to know that we have learned and that we have failed on this issue to communicate with our customers. In the future we will be more open & transparent, and we are also refunding everyone for the month's service".

You could say something along those lines... that would be a start.
Posted by KarlZimmer, 09-13-2010, 02:45 AM

Quote:

Originally Posted by joshribakoff

I understand issues happen, but there are multiple factors at play

1) They sent an email reminding us about the scheduled maintenance , yet did not send one notifying when it went awry. This provided false intelligence that prolonged our decision to jump ship. We regret this.

2) They did not attempt to roll back / did not have a solid enough rollback plan in place. OK technology messes up, but when you have a data center to run you shouldn't go playing around with beta hardware without a solid rollback procedure

3) Just all around lack of response. Phone circuits jammed. No emails. False timelines, false intelligence given out, crappy SLA policy.

1) We specifically had it posted on the front of our web site and the top of our support pages showing the status page which was regularly updated. We said as soon as we felt things were going to run long. As far as false intelligence, what was false? The timelines were actually accurate, it was just as one problem was solved a brand new one arose.

2) We did roll back, we had a plan to roll back, but when we started to roll back we had a chassis failure and two management module failures. We've only had one failure before with a 6500 in the past 5 years we've been using them. How do you plan for that?

3) We kept the site as up-to-date as possible and to the best of my knowledge we responded to every one of the thousands of tickets. In addition, I dare you to find anyone with a better SLA...
Posted by joshribakoff, 09-13-2010, 02:49 AM

Quote:

Originally Posted by KarlZimmer

1) We specifically had it posted on the front of our web site

After the 17hr mark did it cross your mind it might warrant an exceptional mass mailing? This should have taken place 3hrs in, in my opinion.
Posted by KarlZimmer, 09-13-2010, 02:55 AM

Quote:

Originally Posted by pfak

Apparently Steadfast while claiming to have their network back up, is not actually back up. A number of customers I know do not have service still, and support tickets are not being responded to.

24 hours and counting.

Network is up and all support tickets are being answered...
Posted by KarlZimmer, 09-13-2010, 02:59 AM

Quote:

Originally Posted by joshribakoff

After the 17hr mark did it cross your mind it might warrant an exceptional mass mailing? This should have taken place 3hrs in, in my opinion.

3 hours in we were still in the middle of the maintenance window. We set a 6 hour maintenance window for a reason, it was a large project.

Customers were aware of the issue, we answered thousands of tickets and hundreds of phone calls. Regular updates were posted on the site and we made it known here and on our own forum. We will be sending out a full review of the events and future procedure changes, etc. tomorrow. That is a document that certainly could not have been assembled during the rush the day was.
Posted by joshribakoff, 09-13-2010, 03:46 AM

Quote:

Originally Posted by KarlZimmer

3 hours in we were still in the middle of the maintenance window. We set a 6 hour maintenance window for a reason, it was a large project..

Totally missed my point. Should have sent it after 6hrs then. When you fix one thing and another breaks, and that happens 10 or so times in a row.. there's a trend that should be recognized. After 10x or so of thinking "you've got it", you needed to step up and admit you didn't quite "have it". This should have taken place during, not after the incident. We know you're honestly sorry, we just don't care. What matters is what took place, a simple notification would have saved you my business. You emailed when the maintenance started, why couldn't you have emailed when **** hit the fan, as well?
Posted by KarlZimmer, 09-13-2010, 04:29 AM

Quote:

Originally Posted by joshribakoff

Totally missed my point. Should have sent it after 6hrs then. When you fix one thing and another breaks, and that happens 10 or so times in a row.. there's a trend that should be recognized. After 10x or so of thinking "you've got it", you needed to step up and admit you didn't quite "have it". This should have taken place during, not after the incident. We know you're honestly sorry, we just don't care. What matters is what took place, a simple notification would have saved you my business. You emailed when the maintenance started, why couldn't you have emailed when **** hit the fan, as well?

If I had a way to predict hardware failures I would have certainly used that talent here. Once you have one failure, a once in a 5 year experience for us with the 6500's, then you're generally not expecting a 2nd and then a 3rd. You'd love to think you can plan for everything, but you can't. That is one thing we certainly learned here and will certainly be planning around that idea in the future, going in much smaller and much more reasonable increments.
Posted by joshribakoff, 09-13-2010, 04:39 AM

Quote:

Originally Posted by KarlZimmer

Once you have one failure, ..., then you're generally not expecting a 2nd and then a 3rd.

The whole timeline is a chronicle of "we've almost got it".

Quote:

You'd love to think you can plan for everything, but you can't.

Its really not an issue of planning. Its an issue of competence. The plan was not sufficient, I know. The issue I am pinning you on is not the plan though, it was your failure to issue a mass mailing.

When the maintenance exceeds the given maintenance window, that is a critical decision point for your company. You were faced to decide between accepting humility, or trying to sweep your dirt under the rug and hoping noone noticed. You made a decision to play it low key, and it detrimented your customers, and now you assert that you made all the right decisions.

No one asked you to notify us the second you exceeded the window. But what about an hour? Or two hours? Or three? if/when this happens again how long are you going to wait? 17hrs again? ...... I still fail to understand how the *whole* data center going down for 12hrs is a not important enough event to do a mass mailing.
Posted by kaniini, 09-13-2010, 05:11 AM

Quote:

Originally Posted by kdill

Hey life happens. Guess what? It happened to Karl also. He knows the severity of the situation. I mean, what do you tell your clients!? He knows there are going to be people like you that just rage out and leave, but he has to do something. When I read that comment, I had the common sense to see that he only meant to keep people relaxed about the future, not to literally say we are never going to touch our **** for 10 years.

"Life happens" is not an appropriate statement for this situation. This was a voluntary maintenance and they should have been properly prepared for it.

What did I tell my clients? The truth: that steadfast promised this maintenance would not go down this way and then I linked them to their status page. What else could I have done in this situation? Made up things that I didn't know were true or false?

Steadfast was way too terse during this situation, and on top of that, I have heard their "on-site engineers" were actually working remotely. This was mentioned earlier in the thread. Karl promised this would not be the case, that all engineers would be onsite. I do not know who is telling the truth here, but I do know that Steadfast has had engineers work remotely in the past.

I can understand why Karl wants people to be relaxed about the future, but since they have done a rollback, this means we're going to have another downtime in the near future. I cannot relax about that given the fact that this chaos has already happened. There are unresolved questions and there is not even an RFO available yet. Due to all of this, none of my customers have any confidence in the Chicago location anymore - in effect, we have gotten hundreds of transfer requests to our Los Angeles location at QuadraNet and more people asking whether or not our Chicago location will be stable again.

That is how bad this outage was. 20 hours of pure hell for my clients, which makes them have a lot of doubt. I really hope that Steadfast gets this right, and gets the router replacement right so that we can just work through this, but I have to prepare for the possibility that they will not. The good news is that I have a contingency plan for that, but the bad news is that in the short term we're going to just have to live through the chaos regardless.

I like Steadfast as a provider. Typically they have been pretty solid, but then these things happen (and they have happened before) and it really shakes things up...
Posted by spaethco, 09-13-2010, 09:41 AM

Quote:

Originally Posted by nenolod

I have heard their "on-site engineers" were actually working remotely.

Irrespective of the other issues in this thread, this is really an unimportant point. Clearly you need on-site folks to handle physical activities like plugging in cables, but every other aspect of network changes can be managed remotely. You've seen the chassis - there is nothing on the hardware itself that requires the skills of a network engineer to manipulate -- you slide cards into slots and plug cables into ports. With a terminal server for console access and in-band IP access you can manage all aspects of network configuration. Being on-site or off-site doesn't change your approach to configuration and troubleshooting -- it's not like you need to look at lights on the front of the switch for anything.

All the other points and concerns are valid, but calling out on-site vs remote resources is an issue in appearance only.
Posted by chrono-it, 09-13-2010, 10:04 AM
If anyone is still seeing issues please PM or email me directly at marc@steadfast.net with your ticket number and I will have it looked into right away.
Posted by dariusf, 09-13-2010, 11:12 AM
Talk about a stressful Sunday. It is very unfortunate you guys ran in to all these issues and the down window stretched out so much. I have been colocating servers with you for a few years and until now I was extremely satisfied. Very stable, super fast response and support, very friendly, fast hardware access, reasonably priced.

That stated I'm definitly desapointed on several issues.

1) Notification - colocating servers with you for a few years I was totally unaware that I had to request notifications. I maybe understand no default notifications for some shared space website customers but for people colocating their servers? This should be send out as a default. I can't imaging someone colocating servers and NOT wanting to get service notification.

2) Maintenance roll back policy - I feel there is absolutely no excuse for dragging out the decision to roll back. I understand that there are times when this much time might be needed but 6 hours just for roll out is quite excessive if roll back time is not included. The maintenance window should not be 6 hours with out including the rollback time in it. If for example it takes 4 hours to roll out and 2 hours to roll back, then you should have your maintenance window set at 6 hours and deadline set at 4 hours. Once you got to that point of 4 hours you should automatically execute the roll back. Regardless how close you feel you are to completing the roll out.

If everyone has been notified about the maintenance and you initiated roll back at a set point, not exceeding the 6 hour maintenance window then there would be no issues at all. There is always another day you could attempt to do the upgrade. Now having additional roll back hardware issues would extend that a bit more but still would be understandable and close to the 6 hours.

Things happened and there is need to dwell on them but only learn. I hope this unfortunate event will result in upgrades to your procedures and we will not see anything like that again.

This is not a first outage as things happen. I recall the power failure and subsequent backup generator failure at Equinix a few years back.

I am looking to accelerate my implementation of backup servers at other provider, that was my fault for not having it in place.

I have been a very satisfied customer so far and will remain your customer for time to come.

Darius
Cybermash
Posted by kaniini, 09-13-2010, 11:49 AM
I would just like to point out that we still are awaiting an actual RFO statement from Steadfast about this.

In the meantime, I would like to ask why there is so many customer cables going directly into the old core routers? See attached picture.

Is the real reason for switching to the Brocade gear to get rid of the distribution network entirely?
Posted by David, 09-13-2010, 12:01 PM

Quote:

Originally Posted by nenolod

Is the real reason for switching to the Brocade gear to get rid of the distribution network entirely?

No, it was to get rid of the network entirely. Worked for almost ~20 hours until Karl's evil plans were thwarted. Alas, until next time.
Posted by HiDef-Laws, 09-13-2010, 01:01 PM
Man, you guys are lethal in this thread. Steadfast has a rather long history of providing stellar uptime. I'm not going to kill someone for extended downtime over the course of one day due to the ridiculous amounts of bad luck they had with a large infrastructure update. I'm far less upset by that than by not knowing that I needed to sign up for notices to be sent to me. Something this large and with a window that large should really have been sent to all customers...if they complain about receiving the email that's too bad...but they can't say you didn't notify them about it.

I think some people need to take a deep breath before continuing to bash Steadfast. Many people in here have gotten very childish in their rants. Take it private if you need to continue the bashing, I'm sure many don't really care to see you scream and stomp your feet.
Posted by dariusf, 09-13-2010, 01:04 PM

Quote:

Originally Posted by HiDef-Laws

I'm far less upset by that than by not knowing that I needed to sign up for notices to be sent to me. Something this large and with a window that large should really have been sent to all customers...if they complain about receiving the email that's too bad...but they can't say you didn't notify them about it.

I think some people need to take a deep breath before continuing to bash Steadfast. Many people in here have gotten very childish in their rants. Take it private if you need to continue the bashing, I'm sure many don't really care to see you scream and stomp your feet.

Totally agree on both. The notifications for this type of huge change should have been send out at least a couple times well ahead of the change to everyone to make sure all are aware and plan out for this as well.
Posted by The Universes, 09-13-2010, 01:14 PM

Quote:

Originally Posted by HiDef-Laws

Man, you guys are lethal in this thread. Steadfast has a rather long history of providing stellar uptime. I'm not going to kill someone for extended downtime over the course of one day due to the ridiculous amounts of bad luck they had with a large infrastructure update.

I personally don't believe it boils down to just that. My main concern is that SF has consistently downplayed the impact of this "maintenance", and downplayed the resulting issues that occurred. Support provided no information about what was going on, the status page had 2 lines of gibberish and a non-informational letter from Karl. I would really like to see a provider be more open about their mistakes and more forthcoming about what is actually going on and what is being down to address the issues.
Posted by kdill, 09-13-2010, 01:20 PM
Read the first post in this topic, I never once felt it was down played. Im sure karl was sick to his stomach all yesterday worrying about getting your guys service back. You guys act like they did this on purpose and that unexpected things never happen. I want to visit this perfect world.
Posted by kaniini, 09-13-2010, 01:21 PM

Quote:

Originally Posted by kdill

Read the first post in this topic. I never once felt it was down played. Im sure karl was sick to his stomach all yesterday. You guys act like they did this on purpose and that unexpected things never happen. I want to visit this perfect world.

The august 24th email was downplayed, coloured with phrases like "minimal downtime expected".

20 hours of downtime is not minimal.

I am sure Karl was sick to his stomach all day too. Thanks to this, I am pretty sure I now have an ulcer.
Posted by KarlZimmer, 09-13-2010, 01:21 PM

Quote:

Originally Posted by nenolod

I would just like to point out that we still are awaiting an actual RFO statement from Steadfast about this.

In the meantime, I would like to ask why there is so many customer cables going directly into the old core routers? See attached picture.

Is the real reason for switching to the Brocade gear to get rid of the distribution network entirely?

There are no customers connected to the core switches directly, everything goes through an aggregation/access layer off of the core routers. The reason for the Brocades was yes, to replace the Cisco 6500's which were being used as a combined core/distribution configuration for the gear at 350 E Cermak, we have a separate distribution layer setup at 725 S Wells.

The plan now, and as things are setup, is that the Brocades will be simply taking over BGP, OSPF, etc. and the Cisco would act purely as a distribution switch, nothing more, holding customer VLANs, handling VRRP, etc. This separation is significantly more expensive, but should certainly make our network more robust and make an upgrade such as the one we attempted to perform a thing of the past, as simply moving BGP sessions to a an additional router configuration is MUCH, MUCH simpler and a task we've complete successfully on many occasions.
Posted by KarlZimmer, 09-13-2010, 01:22 PM

Quote:

Originally Posted by nenolod

The august 24th email was downplayed, coloured with phrases like "minimal downtime expected".

20 hours of downtime is not minimal.

Yes, and simply put, things did not go as expected.
Posted by kdill, 09-13-2010, 01:24 PM
You didn't read what I said. It would have been minimal, if things had not gone awry. Some stuff you just can't control, a series of unlucky failures that were totally unexpected is going to increase downtime and make that statement look downplayed.
Posted by KarlZimmer, 09-13-2010, 01:25 PM

Quote:

Originally Posted by The Universes

I personally don't believe it boils down to just that. My main concern is that SF has consistently downplayed the impact of this "maintenance", and downplayed the resulting issues that occurred. Support provided no information about what was going on, the status page had 2 lines of gibberish and a non-informational letter from Karl. I would really like to see a provider be more open about their mistakes and more forthcoming about what is actually going on and what is being down to address the issues.

Personally, I thought we addressed that. The issues were noted on the announcement page as they were discovered and as things progressed. In addition, I feel I was very open, saying it was our mistake, our mis-management of the situation, etc. and I am currently working on a letter that will contain some more details. As I'm still crafting this letter, can you tell me how you felt we downplayed the issue, what specifically do you feel we weren't open about, what specific information do you want us to disclose?
Posted by kaniini, 09-13-2010, 01:38 PM

Quote:

Originally Posted by kdill

You didn't read what I said. It would have been minimal, if things had not gone awry. Some stuff you just can't control, a series of unlucky failures that were totally unexpected is going to increase downtime and make that statement look downplayed.

Yes, it is truly unfortunate that he had these problems. I am not saying that it isn't unfortunate, but his staff should have began a rollback inside the maintenance window when it was obvious things weren't going to plan. Not at 3 in the afternoon, many hours after the point, and when they did begin the rollback they should have made it very clear they were doing a rollback.

I have to stress this is not the first time this has happened. When I first moved to Steadfast from Equinix, it was followed up with a string of outages, previously mentioned in this thread. Since then, it has been pretty good though, and I do give him points for that.

However, my customers are out for blood because of this outage, and rightly so. So we have a lot of pressure to ensure that Karl is going to get this right or to relocate their servers to a DC that is not Steadfast. Ultimately what they are looking for is a fix to this problem that will be permanent, so many of us are having to ask Karl questions about the outage and how it was handled.

Karl is a very nice guy. I like working with him. I would like to continue working with him. But I need to know what went wrong, why it went wrong, and how it will be corrected in the next attempt to alter the network topology. If I can't tell my clients this information, then they will be moving their virtual machines to other regions or demanding that we move our POP to a different facility.

Right now I have a good amount of customers doing both of these things, which leaves me at a crossroads as far as options go. The rest, I may have to offer them upgrades or a service credit to get them to stay. Who knows... my customers seem to like action a lot more than service credits. Those options are ultimately: trust that Karl will have fixed the problem in his next attempt and that there will be no more catastrophes in the near future (say a 6 month time window) or start formulating an exit strategy.

Regardless, we are taking action now to move out of reassigned IPs which reduces dependency on our datacenter providers (including steadfast). We owe that to our clients. Hopefully, we won't have to take advantage of that increased portability anytime soon, but it will at least be assuring to our customers once we get it. The reason why it will be assuring is because it gives us the power to leave without forcing them to renumber.
Posted by kaniini, 09-13-2010, 01:49 PM

Quote:

Originally Posted by KarlZimmer

Personally, I thought we addressed that. The issues were noted on the announcement page as they were discovered and as things progressed. In addition, I feel I was very open, saying it was our mistake, our mis-management of the situation, etc. and I am currently working on a letter that will contain some more details. As I'm still crafting this letter, can you tell me how you felt we downplayed the issue, what specifically do you feel we weren't open about, what specific information do you want us to disclose?

I want to know specifically what happened and why you felt you should continue pushing forward instead of rolling back at the first sign of trouble.

I want to know what has changed that will make the next attempt work.

I want clarity on whether or not there will be people from Brocade on-site, not working remotely, actually on-site.

I want to know that you will abort your next attempt if there is any sign of a serious issue. I *need* to know this.

I want clarity on whether or not what Mike said was accurate. (I surely hope you did not fire him over the IRC logs in this thread... if you did, you need to fix that by unfiring him. I cannot support that kind of business strategy and feel good about it when I go to sleep at night.)

Ultimately what I want is something that I can bring to my customers to assure them that everything will be fine with their servers during the next upgrade. If you can't provide such an assurance, then you need to re-evaluate your plan.
Posted by KarlZimmer, 09-13-2010, 01:59 PM

Quote:

Originally Posted by nenolod

Yes, it is truly unfortunate that he had these problems. I am not saying that it isn't unfortunate, but his staff should have began a rollback inside the maintenance window when it was obvious things weren't going to plan. Not at 3 in the afternoon, many hours after the point, and when they did begin the rollback they should have made it very clear they were doing a rollback.

I have to stress this is not the first time this has happened. When I first moved to Steadfast from Equinix, it was followed up with a string of outages, previously mentioned in this thread. Since then, it has been pretty good though, and I do give him points for that.

However, my customers are out for blood because of this outage, and rightly so. So we have a lot of pressure to ensure that Karl is going to get this right or to relocate their servers to a DC that is not Steadfast. Ultimately what they are looking for is a fix to this problem that will be permanent, so many of us are having to ask Karl questions about the outage and how it was handled.

Karl is a very nice guy. I like working with him. I would like to continue working with him. But I need to know what went wrong, why it went wrong, and how it will be corrected in the next attempt to alter the network topology. If I can't tell my clients this information, then they will be moving their virtual machines to other regions or demanding that we move our POP to a different facility.

Right now I have a good amount of customers doing both of these things, which leaves me at a crossroads as far as options go. The rest, I may have to offer them upgrades or a service credit to get them to stay. Who knows... my customers seem to like action a lot more than service credits. Those options are ultimately: trust that Karl will have fixed the problem in his next attempt and that there will be no more catastrophes in the near future (say a 6 month time window) or start formulating an exit strategy.

Regardless, we are taking action now to move out of reassigned IPs which reduces dependency on our datacenter providers (including steadfast). We owe that to our clients. Hopefully, we won't have to take advantage of that increased portability anytime soon, but it will at least be assuring to our customers once we get it. The reason why it will be assuring is because it gives us the power to leave without forcing them to renumber.

To note, with the maintenance, things WERE working, everything was up and running on the Brocade gear without issue at around 6:15AM and we figured we could get any necessary adjustments made for the few customers still seeing issues before the extended period ended at 7AM. The window was scheduled to be 6 hours, because we knew that with the amount of work needed, it would take a good six hours to get done, though the affects on customers was not supposed to be that great. That everything had been working smoothly made us sure it must have been one of the final changes we made that caused the issues, we reverted configs, put configs back in place, worked with 3rd party engineers and Brocade. We got close several times, CPU load going back down, etc. just for it to flare up again. When it became evident there was no way to get it resolved in a reasonable period of time, we went with plan B.

As I stated before, I am working on a complete letter to describe the events and the actions we are now taking.
Posted by KarlZimmer, 09-13-2010, 02:10 PM

Quote:

Originally Posted by nenolod

I want to know specifically what happened and why you felt you should continue pushing forward instead of rolling back at the first sign of trouble.

I want to know what has changed that will make the next attempt work.

I want clarity on whether or not there will be people from Brocade on-site, not working remotely, actually on-site.

I want to know that you will abort your next attempt if there is any sign of a serious issue. I *need* to know this.

I want clarity on whether or not what Mike said was accurate. (I surely hope you did not fire him over the IRC logs in this thread... if you did, you need to fix that by unfiring him. I cannot support that kind of business strategy and feel good about it when I go to sleep at night.)

Ultimately what I want is something that I can bring to my customers to assure them that everything will be fine with their servers during the next upgrade. If you can't provide such an assurance, then you need to re-evaluate your plan.

1) We pushed forward because things were working, things had been operating. We knew a rollback would be at LEAST 2 hours of downtime and were almost certain the repairs to the Brocades could be done in less time than that. That turned out to be wrong.

2) We are going with a completely different configuration, that will be detailed in the letter.

3) We had our own network engineering team on-site, a 3rd party engineer very familiar with Brocade gear, a 24/7 support contract with Brocade and Brocade on the phone from the beginning. We thought we had taken the actions necessary to be prepared. Simply put, we will not cary out another maintenance of this scope again, ever, and the explanation will be outlined in the official letter.

4) Yes, all future maintenance will involve leaving the existing configuration in place and fully configured. There will be no more complete gear swaps so rolling back will be much more trivial, thus not an impediment to doing a quick roll back.

5) What Mike said was not accurate. He had just logged on to the staff chat and saw I was asking for transportation to 350, as I was working out of our other office. Our head network engineer, CTO, 3rd party engineer and various other staff were on-site since 2 hours (or more for some) before the maintenance window. Mike has not been fired, it was a misunderstanding on his part in a hectic and stressful time.

6) That is entirely the focus of the new and revised plan.
Posted by joshribakoff, 09-13-2010, 02:11 PM

Quote:

Originally Posted by KarlZimmer

To note, with the maintenance, things WERE working,

This right here is why I am firing you. My definition of "working" apparently differs from yours. Do I need to post my down time reports? I'll give you a hint, at no point since the maintenance begin did things remain "up" for a continuous 10 minute period, Until 9am, only to go back to up/down all day at 10am.

And "not a big deal" with the lie about engineers being on site? How is a lie not a big deal? He's willing to lie about when the down time occurred, lies about decisions that were made, lies about where the engineers are. The issue is not where the engineers were, as I'm sure if he told us they were working remotely, we would have no problem. The issue is that what he told was was not true. We are not mad at the situation, more so mad at Karl's handling of the situation.

Karl should admit he was in the wrong for not sending out a notification email. He'll get on here and post all day about how "this & that" happened with the hardware which was out of his control, yet will not address the issues that are in his control (the lying, the negligence, the misleading time lines)

Quote:

Originally Posted by KarlZimmer

everything had been working smoothly made us sure it must have been one of the final changes we made that caused the issues, we reverted configs, put configs back in place, worked with 3rd party engineers and Brocade. We got close several times.

When you reverted the configs and it didnt work, and you found yourself on the phone with tech support, don't you think you should have recognized the situation was out of your control?

You wrote: "When it became evident there was no way to get it resolved in a reasonable period of time"

When exactly did you realize? Was it 7am when you told us:
"7:38 AM: We are continuing to work on outstanding issues caused by the router replacements and currently have no ETA on resolution."
Posted by KarlZimmer, 09-13-2010, 02:27 PM

Quote:

Originally Posted by joshribakoff

This right here is why I am firing you. My definition of "working" apparently differs from yours. Do I need to post my down time reports? I'll give you a hint, at no point since the maintenance begin did things remain "up" for a continuous 10 minute period, Until 9am, only to go back to up/down all day at 10am.

And "not a big deal" with the lie about engineers being on site? How is a lie not a big deal? He's willing to lie about when the down time occurred, lies about decisions that were made, lies about where the engineers are. The issue is not where the engineers were, as I'm sure if he told us they were working remotely, we would have no problem. The issue is that what he told was was not true. We are not mad at the situation, more so mad at Karl's handling of the situation.

Karl should admit he was in the wrong for not sending out a notification email. He'll get on here and post all day about how "this & that" happened with the hardware which was out of his control, yet will not address the issues that are in his control (the lying, the negligence, the misleading time lines)

We have NEVER lied. I have tried to be open about this and have been truthful about the issues through the entire situations. The timelines provided were honestly the best known information at the time, but then things changed and additional issues surfaced, with these issues being outlined on the site.

We had a full 20 minutes of complete network stability. Sure, there were likely customer specific issues, that was expected and to be taken care of on a case-by-case basis, but the vast majority of customers, according to our internal and external reporting, were stable.

Honestly, I had no simple way to send an email. Our customer database systems are on our standard network, which was fully affected by these issues as well. Our support site, email, phones, office network, and own web site are on a separate AS, specifically for assuring that they are reachable and then used for updates. having the customer database fully accessible had not crossed our minds and was an over-site on our part, so we used the communication channels we did have available to us. This incident has of course led us to reconsider that and for putting more items on our separate AS.
Posted by KarlZimmer, 09-13-2010, 02:32 PM

Quote:

Originally Posted by joshribakoff

When you reverted the configs and it didnt work, and you found yourself on the phone with tech support, don't you think you should have recognized the situation was out of your control?

You wrote: "When it became evident there was no way to get it resolved in a reasonable period of time"

When exactly did you realize? Was it 7am when you told us:
"7:38 AM: We are continuing to work on outstanding issues caused by the router replacements and currently have no ETA on resolution."

We had just had a working box and a working config, you need to evaluate whether the 2+ hour fix for doing a complete roll back will cause more downtime than working out a usable config. Working with the Brocades we had three separate instances where things were stable or getting to the point of being stable, just to have it all crash down again. It was after this third one, which was around noon, that we decided to roll back. It was being stuck where your options were a known bad, an additional 2+ hours of downtime, or an unknown...
Posted by joshribakoff, 09-13-2010, 02:33 PM
Karl, I can't believe some of the stuff you are writing. Full 20 minutes of stability? Seriously wtf man.
Posted by kaniini, 09-13-2010, 02:46 PM

Quote:

Originally Posted by KarlZimmer

1) We pushed forward because things were working, things had been operating. We knew a rollback would be at LEAST 2 hours of downtime and were almost certain the repairs to the Brocades could be done in less time than that. That turned out to be wrong.

2) We are going with a completely different configuration, that will be detailed in the letter.

3) We had our own network engineering team on-site, a 3rd party engineer very familiar with Brocade gear, a 24/7 support contract with Brocade and Brocade on the phone from the beginning. We thought we had taken the actions necessary to be prepared. Simply put, we will not cary out another maintenance of this scope again, ever, and the explanation will be outlined in the official letter.

4) Yes, all future maintenance will involve leaving the existing configuration in place and fully configured. There will be no more complete gear swaps so rolling back will be much more trivial, thus not an impediment to doing a quick roll back.

5) What Mike said was not accurate. He had just logged on to the staff chat and saw I was asking for transportation to 350, as I was working out of our other office. Our head network engineer, CTO, 3rd party engineer and various other staff were on-site since 2 hours (or more for some) before the maintenance window. Mike has not been fired, it was a misunderstanding on his part in a hectic and stressful time.

6) That is entirely the focus of the new and revised plan.

Yeah, things *were* working. What happened with that by the way? As soon as you took out the old cisco, the network came back to life and everything was happy... I mean things were so happy your new Brocade equipment was pushing rainbows and unicorns through it's unused ports and slots.

It was good times, man. Then things broke again. I was about to go to sleep when that happened. Ultimately, I didn't get to sleep until 9 hours later, when I finally just said "screw it, there's nothing more I can do to salvage this situation, I need to sleep before I punch a hole in my grandparents' nice new wall" (did I mention I am on holiday? what a great start to a holiday...)

It's like someone who is a tease. Your network... it was working... then it went dark. I was so happy it was over. It was working perfectly. Things were looking *great* like rainbows and flowers and bunnies and stuff. And then it didn't. It went dark. For ultimately an additional 14 hours.

Anyway.

When you implement this revised plan: I am quite happy to do you a favor. I have a nagios setup monitoring my lines from Steadfast. If you would like, I can subscribe you guys to the outage alerts for that VLAN.

That way you will know instantly if the problem is bigger than perceived. Deal?
Posted by KarlZimmer, 09-13-2010, 03:07 PM

Quote:

Originally Posted by joshribakoff

Karl, I can't believe some of the stuff you are writing. Full 20 minutes of stability? Seriously wtf man.

In the next post, William seems to confirm our analysis. Again, it is certainly possible that you had a customer specific issue that we were working on getting resolved in that remaining window, but we did significant testing and honestly, the performance and the way things were functioning was amazing. Then it wasn't... This was absolutely the ost frustrating day I had ever had, how we would be teased with things working, just for them to fall apart.
Posted by joshribakoff, 09-13-2010, 03:16 PM
I truly feel bad for you, but I don't do business based upon emotion. Thats not how I got to where I am now, and thats now how I'll succeed.

Please see attached downtime report. The times it said "ok" my website was still not up (taking too long to load, not loading at all, timing out, etc). Essentially it was a continuous 17hr block of downtime on our end. The only communication we received during this 17hr period was the initial one telling us to ignore any downtime. I went to sleep with this down time going on, thinking you had it under control - which you did not.

A single lost sale for me is $1,000+ (in profits lost, not just revenue...)
Posted by kaniini, 09-13-2010, 03:40 PM

Quote:

Originally Posted by joshribakoff

I truly feel bad for you, but I don't do business based upon emotion. Thats not how I got to where I am now, and thats now how I'll succeed.

Please see attached downtime report. The times it said "ok" my website was still not up (taking too long to load, not loading at all, timing out, etc). Essentially it was a continuous 17hr block of downtime on our end. The only communication we received during this 17hr period was the initial one telling us to ignore any downtime. I went to sleep with this down time going on, thinking you had it under control - which you did not.

A single lost sale for me is $1,000+ (in profits lost, not just revenue...)

I am sorry to be rude but frankly I find it hard to believe that you would host a $1000+/mo enterprise on a $30/mo service plan.

The reason why I know it was $30/mo is because everyone was effectively down for 20 hours, meaning they get a 100% SLA credit.

If your business was that big of a deal you would have service from multiple providers so that a failure at one company would not cause your site to go down so that you could continue doing your e-commerce sales...
Posted by joshribakoff, 09-13-2010, 03:45 PM
Yes you are being rude. I was on a $100+ dedicated server, until it got compromised. You don't know me, or how many hosting accounts I have or how much money I really make, so please stay out of my business. For all you know I could make $1 or $1M
Posted by DPG, 09-13-2010, 03:59 PM

Quote:

Originally Posted by KarlZimmer

To note, with the maintenance, things WERE working, everything was up and running on the Brocade gear without issue at around 6:15AM and we figured we could get any necessary adjustments made for the few customers still seeing issues before the extended period ended at 7AM. The window was scheduled to be 6 hours, because we knew that with the amount of work needed, it would take a good six hours to get done, though the affects on customers was not supposed to be that great. That everything had been working smoothly made us sure it must have been one of the final changes we made that caused the issues, we reverted configs, put configs back in place, worked with 3rd party engineers and Brocade. We got close several times, CPU load going back down, etc. just for it to flare up again. When it became evident there was no way to get it resolved in a reasonable period of time, we went with plan B.

As I stated before, I am working on a complete letter to describe the events and the actions we are now taking.

I know that **** happens but this part is alarming. If the actual maintenance was going to take 6 hours and the maintenance window was only 6 hours, there was zero room for error.
Posted by KarlZimmer, 09-13-2010, 04:16 PM

Quote:

Originally Posted by DPG

I know that **** happens but this part is alarming. If the actual maintenance was going to take 6 hours and the maintenance window was only 6 hours, there was zero room for error.

The 6 hours to get it done was allowing time for fix-ups, etc. which was of course calculated in. It was basically allocated as 60 minutes for base/config prep (including two hours of non-service affecting before the maintenance), 90 minutes for core3 replacement, 90 minutes for core4 replacement, and 2 hours for fixing up the odds and ends.

From when we've done similar maintenance before, such as in New York in July, those were all over-estimates as well as that maintenance was done completely, for replacing one switch to a new platform, in less than 2 hours. We've swapped out switch platforms on many occasions previously, Foundry to Cisco, Cisco to Juniper, Juniper to Cisco, and thought we had a good handle on the time that would be needed.
Posted by Scott.Mc, 09-13-2010, 04:18 PM

Quote:

Originally Posted by joshribakoff

Yes you are being rude. I was on a $100+ dedicated server, until it got compromised. You don't know me, or how many hosting accounts I have or how much money I really make, so please stay out of my business. For all you know I could make $1 or $1M

His point is perfectly valid however. Regardless of how much you make (frankly who cares, everyone on WHT always looses millions every second in outages on their $2/month account). If your service is important you should have contingency plans and redundancy.

Now if you want to grumble about being down then that's fine but please be quiet with the statements of I lost $xxxxxxxxxx. That's your problem, not theirs.
Posted by j4cbo, 09-13-2010, 05:20 PM
I was unable to connect to either of Steadfast's phone support lines throughout most of this incident. Are you planning on installing a real, non-VoIP phone line that won't stop working next time your entire network falls over?
Posted by spaethco, 09-13-2010, 05:50 PM

Quote:

Originally Posted by j4cbo

Are you planning on installing a real, non-VoIP phone line that won't stop working next time your entire network falls over?

Why would this matter?

You're probably not going to get more updates than were already provided electronically, and they don't exactly need you to call and report your server being down if it's a DC-wide network event.
Posted by KarlZimmer, 09-13-2010, 05:53 PM

Quote:

Originally Posted by j4cbo

I was unable to connect to either of Steadfast's phone support lines throughout most of this incident. Are you planning on installing a real, non-VoIP phone line that won't stop working next time your entire network falls over?

The phones are on a separate AS and network and we were receiving calls through almost the entire incident. It was likely some sort of routing issue you were facing, though the calls are then hard to do a traceroute on. To solve this, we are going to assure that we gracefully turn down the ports to our own network as part of our new maintenance checklist to assure any potential routing issues/anomalies are passed on.

To note, one single phone line wouldn't have helped much either.
Posted by joshribakoff, 09-13-2010, 05:55 PM

Quote:

Originally Posted by Scott.Mc

If your service is important you should have contingency plans and redundancy.

We'd still have experienced down time. 48hrs for DNS to propagate.
Posted by Steven, 09-13-2010, 05:58 PM

Quote:

Originally Posted by joshribakoff

We'd still have experienced down time. 48hrs for DNS to propagate.

48 hours? Hardly.
Posted by spaethco, 09-13-2010, 05:58 PM

Quote:

Originally Posted by joshribakoff

We'd still have experienced down time. 48hrs for DNS to propagate.

Only for registrar changes. If you were clueful with distributing your DNS servers you could have downtime on the order of a couple minutes, tops.
Posted by dariusf, 09-13-2010, 06:05 PM

Quote:

Originally Posted by spaethco

Why would this matter?

You're probably not going to get more updates than were already provided electronically, and they don't exactly need you to call and report your server being down if it's a DC-wide network event.

One thing that might be useful is a redirection of the calls to automated recording with the details on the outage, but then I would prefer the staff being busy resolving the issue than updating all the status update methods.

I think it boils down to reduce the size of the updates in to smaller more manageable chunks and including the rollback time in to the maintenance window.
Posted by dariusf, 09-13-2010, 06:09 PM

Quote:

Originally Posted by KarlZimmer

The phones are on a separate AS and network and we were receiving calls through almost the entire incident. It was likely some sort of routing issue you were facing, though the calls are then hard to do a traceroute on. To solve this, we are going to assure that we gracefully turn down the ports to our own network as part of our new maintenance checklist to assure any potential routing issues/anomalies are passed on.

To note, one single phone line wouldn't have helped much either.

I did get threw ~ 10am CST or so and spoke to someone forgot the name and was notified of the maintenance and issues. I was getting dropped calls, as in no ringing at all from about 2pm to 3pm CST. At which time I googled this thread...
Posted by panopticon, 09-13-2010, 06:11 PM
This doesn't add up to me:

Quote:

Originally Posted by joshribakoff

Yes you are being rude. I was on a $100+ dedicated server, until it got compromised. You don't know me, or how many hosting accounts I have or how much money I really make, so please stay out of my business. For all you know I could make $1 or $1M

Quote:

Originally Posted by joshribakoff

3) Just all around lack of response. Phone circuits jammed. No emails. False timelines, false intelligence given out, crappy SLA policy.

I've spent $2,000+ and all I'm going to get is $30 credit, if that. I already setup a new account at the planet, the cloud hosting is running about 10x faster on page loads than steadfast was.

Could you at least tell us what box(es) you had at steadfast at the time of the outage?

The SLA won't cover your full losses from such an event if you're a for-profit, but at least it helps cover your time to respond or provides funds for an emergency setup if needed. I find Steadfast Network's SLA in my experience to be very fair and to actually be better than the planet's SLA, also in my experience hosting there for many years now.
Posted by KarlZimmer, 09-13-2010, 06:29 PM

Quote:

Originally Posted by dariusf

One thing that might be useful is a redirection of the calls to automated recording with the details on the outage, but then I would prefer the staff being busy resolving the issue than updating all the status update methods.

I think it boils down to reduce the size of the updates in to smaller more manageable chunks and including the rollback time in to the maintenance window.

That does make sense and would probably save us time. We'll see what we can do to work out a system for our staff to be able to insert and update such a message. Thank you for the suggestion.
Posted by joshribakoff, 09-13-2010, 07:31 PM
Please fill me in on how downtime due to DNS could possibly be prevented? What does it matter how many DNS servers I use or what I set the TTL to? I can't force a user's ISP to not cache the records? 48hrs far-fetched? You're fooling yourself. Even steadfast quoted me that estimate of 48hrs when I've asked about this in the past, and even though I observe my own ISP following TTLs, I have conducted enough experiments in the past and talked to enough of my customers to know that it truly does take up to 48hrs. Sad but true. If there's some way around this let me know.
Posted by spaethco, 09-13-2010, 07:51 PM

Quote:

Originally Posted by joshribakoff

Please fill me in on how downtime due to DNS could possibly be prevented? What does it matter how many DNS servers I use or what I set the TTL to?

Well, clearly if your DNS servers are located on the network that's down, it's not going to work. The records at the gtld root (.com/.net) are set to 48 hours of cache time, so if you need to point to new DNS servers it will take a while for the records to age out on servers.

Quote:

Originally Posted by joshribakoff

I can't force a user's ISP to not cache the records? 48hrs far-fetched? You're fooling yourself.

You set a lower TTL for the records you want to be available for DNS failover -- something like 180 seconds. The number of ISPs out there that ignore DNS TTLs is a number very close to 0. I'm basing that off not only my experience with my personal gear, but also on professional experience with our globally load balanced member portals for one of the largest healthcare companies in North America.

Quote:

Originally Posted by joshribakoff

I have conducted enough experiments in the past and talked to enough of my customers to know that it truly does take up to 48hrs.

Either you're doing something wrong in testing, or your collection of users is such a statistical anomaly that you should start playing the lottery. We observe a ~99.9% tracking rate within 15 minutes on DNS changes according to session count numbers we track when we move things around.
Posted by joshribakoff, 09-13-2010, 08:00 PM

Quote:

Originally Posted by spaethco

your collection of users is such a statistical anomaly that you should start playing the lottery.

Hmm well Friday I logged into my registrar and edited the 'A' record for a customer that couldn't reach one of my sites (my fault, had it pointed to the wrong IP). After fixing it I immediately could ping the new IP, because I keep my TTL low like you say. However, this user was unable to access my site until Sunday evening, he checked it every few hours. I did not ask where he was geographically located. Anyways shortly after that customer wrote me to let me know the DNS had propagated, my whole server went down due to this outage. I'd love to take what you wrote at face value but that doesn't explain my users complaining. My users typically don't complain for no reason.

Also Karl has offered me an SLA compensation I feel very good about, however I'm still unable to reconcile what went down. They rolled back the old routers, which means they're going to attempt the whole upgrade again at a later date now? ...

I found sources that back up what I am saying. What sources do you have to the contrary? I can't post them but search for "dns propagation" and click on the devshed link. This says that ISPs *do* in fact cache the records, and can be up to 72hrs
Posted by spaethco, 09-13-2010, 08:11 PM

Quote:

Originally Posted by joshribakoff

Hmm well Friday I logged into my registrar and edited the 'A' record for a customer that couldn't reach one of my sites

'A' records at the gtld root (ie, the records you set at your registrar) are set to 48 hours, as I stated above.

The key is to have your own distributed DNS servers already setup with the registrar, and then you can modify records that have TTLs that you can control.

Quote:

Originally Posted by joshribakoff

I found sources that back up what I am saying. What sources do you have to the contrary?

My source is actually implementing and operating DNS failover solutions on the production Internet.
Posted by dariusf, 09-13-2010, 09:01 PM

Quote:

Originally Posted by joshribakoff

Hmm well Friday I logged into my registrar and edited the 'A' record for a customer that couldn't reach one of my sites (my fault, had it pointed to the wrong IP). After fixing it I immediately could ping the new IP, because I keep my TTL low like you say. However, this user was unable to access my site until Sunday evening, he checked it every few hours. I did not ask where he was geographically located. Anyways shortly after that customer wrote me to let me know the DNS had propagated, my whole server went down due to this outage. I'd love to take what you wrote at face value but that doesn't explain my users complaining. My users typically don't complain for no reason.

You can also pay for DNS failover service or like spaethco mentioned host your own DNS on two deferent networks.

here ate some threads on DNS failover

webhostingtalk.com/showthread.php?t=524788

webhostingtalk.com/showthread.php?t=574218
Posted by ManagerJosh, 09-13-2010, 10:50 PM
For those of you who have not received the report, Karl posted it earlier today. It is available to read at https://support.steadfast.net/index....ews&newsid=286 or you may read it below.

Quote:

Originally Posted by KarlZimmer

This is being sent to the primary address on the account. If anyone else in your company or department needs this information please forward it to them.

First of all, I am very sorry. I cannot say that enough, I truly apologize for the damage the outage has done to your business, the calls you had to receive and the irate customers you had to handle. We fully understand the severity of this situation and that this negatively affects your business, as this is the core of our business, providing reliable connectivity and data center services. These types of events also damage our own business. We fully understand the need for reliability and feel that our stellar uptime performance up to this point is a testament to that. The upgrade was being made as a significant investment on our part to assure that we maintain this stability and reliability as our network continues to grow. We have let you down and have not lived up to our promises. People come to us because of the reliability and level of service we provide and in this case, we did not provide the service that you expect of us or that we expect of ourselves.

Simply put, we got in over our heads. We have completed similar migrations in the past without issue, including swapping out the Juniper router in New York for a Cisco that was successfully completed in early July and our previous swap from Juniper core gear to Cisco core gear in Chicago. After months of evaluating options, and weeks of having the hardware in-hand for testing, we were confident we could perform such a migration again. This time, with the size of our customer base in Chicago, this was just too much and too much of a risk. One of the many reasons customers like us is that we’re a small company, a company that can be nimble to the needs of our customers, but in this case, we were too small to handle the demands of a migration such as this.

Several hours before the upgrade was to begin, we had made sure that all of our in-house engineering staff was on-site for the maintenance, along with 3rd party network engineering for support and additional supervision. Things took a little longer to get set-up initially, which is why the initial window was extended to 7AM. At roughly 6:15AM everything was up and running on the new Brocade equipment, BGP sessions were all up, customer traffic flowing, and everything looked great, just some minor things to touch up. Then, at 6:50AM, once we thought things were done and settled in, everything just collapsed. We still do not know what caused this collapse, but CPU load spiked across all of our core equipment, even the remaining Ciscos we had in place. As things had been working, it was determined to push forward with resolving those issues, figuring it would be a relatively quick fix, since things had been working without issue in that configuration. There were multiple occasions where we seemed to have stabilized, and again, more issues. We eventually decided to revert to our old configuration and go with plan B. While reverting back to the old setup, we had multiple issues right away, a failed chassis and management card, which immediately caused delay and then the replacement management module also had hardware related problems and needed to be replaced. Even with the known working configuration we began having various routing loops, routing issues, BGP convergence issues, etc. During all of this, there was a Cisco VTP issue, thus during this time we also needed to go around to all of our dozens of customer switches to manually reconfigure them, assuring that they had valid VLAN tables. Stability has been returned and network operations are back to normal. If you have any network issues at this time, please contact us immediately.

We thought we had everything prepared and had spent weeks in configuration and testing, but it appears we were wrong. I don’t need to tell you that things did not go as planned. During this time, we worked extensively with 3rd party engineers as well as both Cisco and Brocade engineers. There is no blame being put in any specific gear or vendor, everything was a part of the problem and we are responsible for it all, we are to blame. Mistakes have been made, but we have certainly learned from these mistakes. Learning from this experience will make us a better company long term and is greatly going to affect how we see and plan things in the future. We need to assume that if things can go wrong, they will go wrong, even though we’re normally a hopeful and optimistic bunch.

These changes will be implemented to assure these issues never happen again:
1) The new Brocade routers will not be used for the initial planned purpose. Instead, we will be investing in a new configuration, where we will completely separate the core and distribution layers of our network. This means there will be no changes with the current Cisco configuration, other than we will at some point be gradually and gracefully moving individual BGP sessions over to the Brocades. These should be entirely non-invasive maintenances, just gracefully shutting down and moving BGP sessions.
2) We commit to building a network infrastructure and maintenance policy where we will never have to force a widespread outage. The separate aggregation, distribution, core, and edge structure of the new network will greatly assist in that goal. This means the backup/roll back gear will always be left in place, as-is, and transitions will be made slowly, over-time. Doing a maintenance spread out gradually over 6 months is a much better option than taking any risk of an event like this happening again. The primary objective in future planning is to mitigate the most risk.
3) We will send an email to customers about major maintenance windows, even though we have received many complaints previously about doing so. If you don’t care about the maintenance, delete the email, it affects all of our customers and we want to assure everyone knows. We will continue to post the maintenances on the announcements page (https://support.steadfast.net/index.php?_m=news) as per our terms of service. All future announcements will include a maximum risk assessment, not an estimate of the actual downtime. We will assume worst case, so you can take the actions necessary to prepare for that worst case.
4) For future maintenance windows we will be bringing an extra staff member in specifically for managing communication, assuring the site and forums are kept up-to-date with as much information as possible. This will not be necessary 95% of the time, but we need to plan for the worst. Updates will be made regularly, even if there is little to no change.
5) We will be changing the structure of our network engineering department and management. All network engineering decisions will be solely made by network engineers, not by management or accountants.
6) Colocation customers can talk with our sales department (sales@steadfast.net) for free consulting and cross connects to reach our other bandwidth partners, to have a redundant multi-homed network configuration of your own. We even have no commit pricing available from these partners, perfect for use as a redundant/backup link and for use during any future scheduled maintenance windows.

I know it may not be easy, but I am asking you to stick with us through these times. We have provided robust and friendly service up to this point, don’t let this one incident, even though it was severe, destroy the quality business relationship we have together. If you help us through this time with your understanding, we can assure you this will pay off long-term dividends. We have learned, these mistakes will not be made again. Let’s grow together. If you have any additional questions or comments you can address them through our standard support channels or by contacting our management directly at management@steadfast.net

Notes:
1) The InterNAP FCP was removed from the network, to prevent possible BGP issues due to it. It will be re-added within the next 48 hours, but you may have some sub-optimal routes until all the routes are updated through the InterNAP FCP.
2) We will honor our SLA. Instead of an SLA credit, we can provide free upgrades of RAM, memory, and bandwidth. These upgrades can easily equate to a much larger long-term benefit than a single credit, while it is also a short-term benefit to us.
3) If you have a Cisco switch, make sure that you have VTP set to transparent (vtp mode transparent) or properly configured. By default, the switch likely has VTP active with no authentication, meaning any switch you’re connected to can affect your VLAN table and potentially bring down your network.

Karl Zimmerman
Steadfast Networks
President/CEO

Posted by Steven, 09-13-2010, 11:18 PM

Quote:

2) We will honor our SLA. Instead of an SLA credit, we can provide free upgrades of RAM, memory, and bandwidth. These upgrades can easily equate to a much larger long-term benefit than a single credit, while it is also a short-term benefit to us.

How is that even a comparable sla credit? Those upgrades do not help your customers in the short term for the SLA's they must pay out to their customers - Rememeber the sla's you would have to pay out are likely lower then your customers pay out as they may have many servers with lots of individual customers.
Posted by ManagerJosh, 09-13-2010, 11:54 PM

Quote:

Originally Posted by Steven

How is that even a comparable sla credit? Those upgrades do not help your customers in the short term for the SLA's they must pay out to their customers - Rememeber the sla's you would have to pay out are likely lower then your customers pay out as they may have many servers with lots of individual customers.

With all do respect, it's an option on the table for each customer as each customer's requirements will vary. Some may find it as a viable alternative and some may not.
Posted by misterd, 09-14-2010, 12:12 AM

Quote:

Originally Posted by KarlZimmer

3) We will send an email to customers about major maintenance windows, even though we have received many complaints previously about doing so. If you dont care about the maintenance, delete the email, it affects all of our customers and we want to assure everyone knows. .......

I'm glad this is happening. --- Several providers that I'm also using will create a trouble ticket when there is going to be a maintenance & possible outage, and it's updated when there's something new. -- I'm not sure how Kayako & Ubersmith work, but if you can do something like that along with announcement on Steadfast.net, it will be perfect.
Posted by BELLonline, 09-14-2010, 09:14 AM
I've been with Steadfast for 2 years now and their network has been almost faultless until this problem. These things happen, they made an upgrade and it went wrong - but they have clearly leaned from what happened.
Posted by KarlZimmer, 09-14-2010, 01:50 PM

Quote:

Originally Posted by Steven

Just to confirm - curious about the new setup.

The brocades will become core/bgp while the existing 6500's will be distribution?

Do you have any ETA on that maintenance?

Basically, yes. The Brocades will handle the core/BGP and the 6500's will just handle distribution for chi01 and chi02.

It will likely be 1-2 months out, as we need to install some addition cabs and power, etc. and should be much, much simpler. The transition would involve moving over BGP sessions gradually.
Posted by superblade, 09-14-2010, 01:59 PM

Quote:

2) We will honor our SLA. Instead of an SLA credit, we can provide free upgrades of RAM, memory, and bandwidth. These upgrades can easily equate to a much larger long-term benefit than a single credit, while it is also a short-term benefit to us.

Quote:

Originally Posted by ManagerJosh

With all do respect, it's an option on the table for each customer as each customer's requirements will vary. Some may find it as a viable alternative and some may not.

I don't think it's an option for each customer. I asked for details on the free upgrades and was told it wasn't even an option for VPS clients. I guess i can understand this, but it wasn't clear in the statement that was sent out.
Posted by KarlZimmer, 09-14-2010, 02:20 PM

Quote:

Originally Posted by superblade

I don't think it's an option for each customer. I asked for details on the free upgrades and was told it wasn't even an option for VPS clients. I guess i can understand this, but it wasn't clear in the statement that was sent out.

That shouldn't be the case, PM me your ticket # and I'll see what I can do.
Posted by sirius, 09-14-2010, 04:29 PM
It appears that this issue is now resolved, please feel free to use Steadfast's normal support channel's for any further issues.

</thread>