steadfast.net / Servervina network is down! [MERGED]

Portal Home > Knowledgebase > Articles Database > steadfast.net / Servervina network is down! [MERGED]

Posted by jayzee, 07-19-2008, 05:14 AM
Hello, steadfast.net / Servervina network is down! I couldn't reach either one of them.
Posted by scooby2, 07-19-2008, 05:19 AM
Back up but yes the whole she-bang was for about 5 minutes.
Posted by ImLagging, 07-19-2008, 05:19 AM
My servers are going up and down. Hurray for the rollercoaster! They seem to be mostly up at the moment. Hasn't been more then 9 minutes.
Posted by JLHC, 07-19-2008, 05:19 AM
It is up from here. And so do http://www.downforeveryoneorjustme.com/steadfast.net
Posted by jayzee, 07-19-2008, 05:20 AM
It is back online. Still some packet loss.
Posted by VINAX, 07-19-2008, 06:04 AM
The network was down for a few mins and backed up. I don't see packet loss. BTW, we have changed ServerVina to a new name: VINAX
Posted by David, 07-19-2008, 03:09 PM
Oddly enough, didn't see anything on this end but I did have a complaint or two about it being offline for about 4 minutes.
Posted by Scott.Mc, 07-19-2008, 03:58 PM
Yeah it was down around 10AM GMT because I just had gotton to sleep and 3-4 minutes later http://www.admingeekz.com/hostdown.wav was going through all of the speakers. It did seem to recover about 4-5 minutes later.
Posted by qlites, 07-20-2008, 09:25 AM
They are down again, and was down for a few minutes earlier this morning. Seems that this is getting more frequent.
Posted by futurehosting, 07-20-2008, 09:27 AM
Yep. They said yesterday they had an issue with the FCP. Didn't even know they were using the FCP.
Posted by David, 07-20-2008, 09:34 AM
Wish they'd get this sorted, mmm!
Posted by Armageddon21, 07-20-2008, 09:35 AM
Weve been completely dark on the value netowrk for about 20 minutes so far. And 15 before that Level3 was completely down. Its the third time we have major issues in the past 30~ hours. Hope it will get fixed for good!
Posted by Moopy, 07-20-2008, 09:42 AM
9:41 am ET time. Everything still down.
Posted by IPv6, 07-20-2008, 09:43 AM
Just timed out, came back, timed out. e: and my server's back, but rdp/rdc is horrifyingly slow e2: lol roller coaster
Posted by David, 07-20-2008, 09:45 AM
Seems back now. Edit: Scratch that.. offline.
Posted by Moopy, 07-20-2008, 09:46 AM
Same here... online for 2 mins, then offline again and online again now at 9:47am et and offline 9:50am et Last edited by Moopy; 07-20-2008 at 09:50 AM.
Posted by IPv6, 07-20-2008, 09:49 AM
there's nothing in the news section in support desk regarding this, strangely
Posted by hbouma, 07-20-2008, 09:50 AM
Probably because its been offline for them too. Hal
Posted by Scott.Mc, 07-20-2008, 09:52 AM
steadfast, gnax, fidelityaccess have all recovered now for me.
Posted by David, 07-20-2008, 10:08 AM
Out again?.. Off to a great start today.
Posted by Armageddon21, 07-20-2008, 10:09 AM
Down again, Wow this is getting real bad. Anyone got an update from them directly of what is going on?
Posted by Scott.Mc, 07-20-2008, 10:09 AM
It went off a few mins after it recovered for another few minutes, then it's gone again now.
Posted by Moopy, 07-20-2008, 10:15 AM
Down 10:13am EST fun fun fun up down up down up down up down up down.
Posted by ImLagging, 07-20-2008, 10:18 AM
My servers are up, but I seem to be getting some random problems with at least 2 of my servers (that I've noticed so far). I seem to be able to access everything just fine except for HTTP on one server and it seems to be sporadic.
Posted by marmoset01, 07-20-2008, 10:24 AM
The DC has been over-ran by zombies. Steadfast is doing their best to eradicate the undead problem, please stand by. Pictures of the action: http://img241.imageshack.us/img241/4...zombieshy0.jpg Last edited by marmoset01; 07-20-2008 at 10:31 AM.
Posted by mixx941, 07-20-2008, 10:29 AM
One of our Premium network servers has been responding on the main IP all throughout the outage but unfortunately not on the addon IPs where the important services are. Another Premium server has been down on all IPs, even from inside Steadfast's network from the one server I can access. The one server that has been completely unreachable and the addon IPs on the other came back up around 8:55 AM CDT and have been coming and going over the last while (currently down). The issues started around 7:55 AM CDT. If your HTTP is on an addon IP then it might be a similar issue to the one we've noticed with addon IPs. Last edited by mixx941; 07-20-2008 at 10:33 AM.
Posted by RodrigoBR, 07-20-2008, 10:33 AM
My two servers (208.100.x.x) down here. Has anyone received Steadfast reply about the problem? Nothing for me. Last edited by RodrigoBR; 07-20-2008 at 10:44 AM.
Posted by ImLagging, 07-20-2008, 10:39 AM
I sure hope they had bought and read this book. http://files.myopera.com/CthulhuSave...ival_guide.jpg I didn't even think of checking the main IP since there's no site on it. The main IP works, but all addon IP's are down at the moment. I'm not even able to ping them anymore. For the other server, I seem to be having a DNS issue on one of my personal sites. It uses the main IP of the server and so far seems to be the only one that I can't resolve to an IP at the moment. hmm... Now that I know what the problem is and I'm looking into it, any of my customers that are on secondary IP's are down.
Posted by David, 07-20-2008, 10:41 AM
The response I got was "core3 issues", waiting for a more thorough one now..
Posted by IPv6, 07-20-2008, 10:42 AM
Lol, hm, a few mins ago my 67.202 IP is up and has packetloss, 208.100 is down completely..
Posted by marmoset01, 07-20-2008, 10:42 AM
I have.
Posted by RodrigoBR, 07-20-2008, 10:47 AM
Ok, thanks. Many downtime/instabilities here. Server UP, DOWN, UP, DOWN, UP, DOWN..
Posted by ImLagging, 07-20-2008, 10:51 AM
Steadfast has been stable for me for the most part this year. If I remember correctly there was an incident or two early this year, but since then, it's been smooth sailing. Untill yesterday that is. Hopefully this doesn't take much longer.
Posted by Armageddon21, 07-20-2008, 10:51 AM
This has been going on for almost 2 hours now. I was expecting a LOT more from them. Really disappointed
Posted by futurehosting, 07-20-2008, 10:54 AM
Yeah, thats what we got the previous two issues the last few days. Don't know about now though.
Posted by ImLagging, 07-20-2008, 11:05 AM
Things seem to be back to normal for the moment.
Posted by RodrigoBR, 07-20-2008, 11:05 AM
I am disappointed too, not for the problem, because I know that Steadfast is doing the best to fix all issues and (like ImLagging said) Steadfast has been stable for me for the most part this year. But they closed my tickets (PKF-450447 and YEG-561784) with no reply, this is a disrespect with an old customer.
Posted by ImLagging, 07-20-2008, 11:08 AM
One thing I've noticed is that they also seem to be hard to get ahold of when problems do happen. So, we're basically left in the dark for the most part during the problem. However, during normal operations, they're great and almost always quick to respond.
Posted by RodrigoBR, 07-20-2008, 11:08 AM
Well, finally reply to my new ticket opened: ".... There was a network issue affecting our entire network, but it has since been resolved. ..." Not a complete reply, but at least is something hehe. Waiting to know if all is really working 100% with no more dowtimes.
Posted by David, 07-20-2008, 11:14 AM
What a response Rodrigo! Don't forget to thank Captain Obvious for me.
Posted by KarlZimmer, 07-20-2008, 11:42 AM
The issue yesterday and today has been with a very large DDoS. After yesterday, we had thought the issue was resolved after the IP being attacked was null-routed, but today things resumed again on several different IPs for the same customer and we have now been forced to ask that customer to leave. The attack today was of much greater magnitude and was many many millions of packets per second. The sheer number of packets was simply overloading the routers, causing them to drop BGP sessions, etc. Part of the problem is that right now the network is a bit overly complex, partially because of the fact we're operating two separate networks, performance and standard, which had forced this large number of packets through the same router multiple times. We are currently in the process of making major network changes to simplify the network, which will then reduce the effects of these types of attacks and also speed up the time it takes to resolve them. These changes involve us going to a single network product and increasing the number of and functionality of our core Cisco switches/routers. Also, Rodrigo, from what I can tell, all tickets had been responded to, but during the rush it seems your tickets had both been marked as duplicates, as they're almost exactly the same, instead of just one of them. I apologize for the oversite. Then the reason for delay in offering the reasons for the outage is that we would prefer to have a full and detailed reason before offering an announcement, such as this. We also prefer to offer as detailed of plans as possible as to how we're going to prevent the issue from occurring in the future, this time it is something we had already planned to do. We do not like taking the risk of giving incorrect information (such as the bit about a problem with the FCP yesterday), which is generally what happens when information is released before our whole staff discusses it together, or incomplete information. Last edited by KarlZimmer; 07-20-2008 at 11:51 AM.
Posted by Scott.Mc, 07-20-2008, 11:47 AM
Thanks for the response but any sort of notice or acknowledgment would have been more helpful, instead people have to second guess on a forum. The intentions may be good but it's not practical in these situations which you've had quite a few of, so ultimately some sort of announcement system would be very helpful. Last question, how do we claim SLA credits.
Posted by KarlZimmer, 07-20-2008, 11:58 AM
As a note, basically this same announcement has been made on our forum. We are not trying to hide anything and do want our customers to have a full and complete explanation.
Posted by KarlZimmer, 07-20-2008, 12:12 PM
From our point of view, the network was down, there were issues, people knew that, what more is there to announce during the event? We would prefer that people be able to resolve the issue. What does it accomplish to have people on the phone saying, "We're having network issues, we don't know when things will be back to normal." You'd rather we dedicate our staff to answering the phone and working on giving customers information they already have, that the network is down/having issues instead of to resolving the issue? I know that you want details, but on a Sunday morning we can only get so many people in on short notice. Everyone we could get was there busy working on resolving the issue. I know you want details as soon as possible, but in some cases it is simply not feasible, but for all major outages we do make public announcements with full details, etc. Also, one DDoS event makes it so we've had "quite a few" issues? This is the first incident of the year, scheduled or not, where we have had any more than 10 minutes of downtime, and those events have not been frequent. Details on claiming SLA credits are detailed in the SLA itself.
Posted by David, 07-20-2008, 12:17 PM
Karl, While I'm not one to stand for allocating people to sit and spout out 'it's down, we're working on it, here's an eta' every two minutes -- a single location, off-network where we can all view the existing status would be great. Updates could be posted there and we could all read them en' mass rather than inundating your support team, take dreamhoststatus.com as an example (albeit, not a shining one, but an example nonetheless).
Posted by softtech, 07-20-2008, 12:34 PM
I think this thread and forum is off-network enough? And the best part is that one of the steadfast customers announced it first and when it was resolved Karl made his comments known about the event. I think this is about as good as I would expect for the time being. Although it would have been better had nothing gone down in the first place. As for my company we keep one server off our main backbone that hosts are website and email. I find its always good to be able to keep up the web site and email even in the worst catastrophe. Especially since this sort of outage gets a lot of people to pay their bill who tend to be chronically late each month. There is an upside to every problem.
Posted by David, 07-20-2008, 12:36 PM
klb5, Webhostingtalk is not a proper communication medium. With a site dedicated to network status & updates, we could all happily add the rss feeds to our readers, phones, browsers and such -- and be up to the date with any information steadfast knows resulting in great communication instead of poor or no communication.
Posted by KarlZimmer, 07-20-2008, 01:13 PM
The issue is, you already knew what our tech support staff did, the network was down/having heavy packet loss. If you're checking the status page, you probably already know that, thus there is no new information gained, making the whole thing seem like a waste of time/effort. If we were able to have ETAs, etc. I can see some use, and if we ever do have an ETA we tell customers, etc. but the fact is, most issues are spending 98% of the time figuring out what the issue exactly is, then once it is figured out it is fixed faster than we'd be able to make a post on a page stating the ETA. The only benefit I can see in the whole thing is that the customers would know we're aware of the problem and are working on it, but that should be a given. What type of information would you realistically expect on that page? What do you see would be the goals and benefits? If I can see any benefit in it, I'd have no issue doing it, but as I said, I can't think of many cases where it would be much more than "There are network issues, we're looking into it." until we're able to discuss everything that happened afterwards and issue a full statement.
Posted by Scott.Mc, 07-20-2008, 01:26 PM
I think you've missed my point entirely, I already know you are working on it that's why I never bother providers when they have issues, my point is it would have been nice for it to be acknowledged in some way. A status page (Which I can't see to find anything of) , a forum post (couldn't see one either) would have helped clarify and you'd have not received as many tickets/calls from people. My other part was you have had some issues that I can recall, a minor power issue (I believe this only effected specific customers, but still was an issue we encountered), there was an issue at the tail end of last year too, yesterday morning and then today. I also agree with your ETA part, they are not plausible in most cases but a simple "we are working on it" goes a long way.
Posted by KarlZimmer, 07-20-2008, 02:28 PM
From my point of view, a "We're working on it." is basically meaningless, as well, that can be assumed, but if there is that much interest in it, I'll see what I can do to get something worked up for that. I'm thinking it might work best to just have a page showing the live status of most of our customer aggregation switches and possibly our BGP sessions, so that you could see the extent of an outage, etc. as well. Something like that would be more useful than a "We're having issues" page and I can see some utility in that.
Posted by Scott.Mc, 07-20-2008, 02:40 PM
Anything is better than absolutely nothing at all and if you can't see that. If people assumed then this thread would not exist.
Posted by Mekhu, 07-20-2008, 03:01 PM
Karl, Scott is dead on with this request. I can't see anyone complaining if something like this went online. Example: 12:08pm EST: We're working on it 12:59pm EST: Things are looking good but no ETA still 2:54pm EST: We're back online Even something that simple would do wonders for easing our clients questions/concerns when downtime does hit.
Posted by Popsikle, 07-20-2008, 03:49 PM
Dont your VPS's offer DDos Protection already? Why the whole network upgrade if you already have DDoS protection?
Posted by layer0, 07-20-2008, 05:16 PM
There are many forms/methods of protection, I'm sure this is just an upgrade to address more advanced attacks, or attacks that simply weren't covered as well before. This is pretty natural progression IMO.
Posted by NickCatal, 07-20-2008, 05:31 PM
Karl posted some more details on The Steadfast.net Forums
Posted by Scott.Mc, 07-20-2008, 06:42 PM
Looks like it's down again.
Posted by Littleoak, 07-20-2008, 06:43 PM
Yes, our servers are inaccessible.
Posted by Armageddon21, 07-20-2008, 06:44 PM
This one gonna hurt bad, its our highest peak of the week , sunday night. ...no comments.
Posted by RodrigoBR, 07-20-2008, 06:45 PM
Karl, Ok, but in the next time please reply to my tickets and not close them ok. I opened 2 tickets because in one of them, I forgot information about servers.
Posted by RodrigoBR, 07-20-2008, 06:48 PM
Ops, really all down again. :/
Posted by IPv6, 07-20-2008, 06:50 PM
Dropped at 06:41:33 eastern here. (lol, i was doing something with timestamps at that time on the box)
Posted by Apolo, 07-20-2008, 06:54 PM
Down here as well. 14 218 ms 215 ms 228 ms tbr1.cgcil.ip.att.net [12.122.17.134] 15 180 ms 177 ms 176 ms 12.122.99.25 16 * * * 17 * * *
Posted by IPv6, 07-20-2008, 06:56 PM
On the bright side, I'll be getting rich this month.
Posted by KarlZimmer, 07-20-2008, 06:59 PM
It seems to be the same issue yet again, on initial review.
Posted by Mekhu, 07-20-2008, 07:00 PM
Thanks for the update Karl. I'm curious why this only is effecting on a specific portion of our servers even though both are on the Premium Network? Not that I'm complaing Feel free to PM me.
Posted by blackstone, 07-20-2008, 07:01 PM
What industry is the problem customer in?
Posted by geedeedee, 07-20-2008, 07:02 PM
I thought the problem customer was already terminated?
Posted by IPv6, 07-20-2008, 07:04 PM
doesn't mean packets will stop
Posted by bmhrules, 07-20-2008, 07:04 PM
I just hope it isn't my box causing these problems. T_T
Posted by KarlZimmer, 07-20-2008, 07:06 PM
As with the other issues, the issue is that the DDoS is just so many packets it is simply over-running our core switches. With the core switches in this state it is simply causing issues in getting into that hardware to properly diagnose and resolve the issue. We are working on it though.
Posted by KarlZimmer, 07-20-2008, 07:09 PM
The customer had been asked to leave and said their systems would be turned off, as the data had been relocated to another data center. We had noted all their systems had been turned off, though we gave them until the end of their billing cycle until they needed more data. It seems that they had turned one of the systems back on and/or they are just attacking what were their IPs, which will still hit our network. The customer hosts various sites, forums, etc. and has not been an issue previously, but the scope of these attacks have been beyond what we have normally seen.
Posted by Branzone, 07-20-2008, 07:09 PM
Is there a status page or something where problems are posted? During this type of problem your support area is not accessible. How are we suppose to find out whats wrong or when it will be fixed? By posting here?
Posted by Mekhu, 07-20-2008, 07:11 PM
Thanks for the honest updates. That's all we can ask for in a situation like this.
Posted by hbouma, 07-20-2008, 07:12 PM
Karl's only said they've told the customer to leave. It doesn't mean they've actually been removed off of the network yet. (At least on WHT. I don't know what was provided on Steadfast's forums yet). Hal Update: looks like he's clarified this while I typed this in. Last edited by hbouma; 07-20-2008 at 07:15 PM.
Posted by blackstone, 07-20-2008, 07:12 PM
This data that you are providing is exactly what those above were saying would be ideal for a status page. Even these few sentences keep us up-to-date and let us know that progress is being made. Without updates it's a guessing game on what the issue is and when it could be resolved. You could copy and paste your posts (complete with timestamps) from this forum into a simple status page and that would probably make everyone more happy.
Posted by futurehosting, 07-20-2008, 07:14 PM
Agreed - and Karl, think about what would happen in the event of a complete disaster. You would be better served having an off-network status page that all your customers know about. I doubt every one of your customers comes from WHT and thus they would be served well by having some sort of notice system. Just something to think about.
Posted by KarlZimmer, 07-20-2008, 07:20 PM
As a note, I had already said earlier on that we'd be working on a status page with the status of various aggregation switches, etc. so that you can easily see the scope of issues and get updates, etc.
Posted by blackstone, 07-20-2008, 07:27 PM
That sounds wonderful. Any ETA on when this issue will be resolved?
Posted by Armageddon21, 07-20-2008, 07:28 PM
What about an update with an honest ETA? Its already been a while now. Thank you!
Posted by DeltaAnime, 07-20-2008, 07:29 PM
Just let them work :-) I'm sure at this point they're trying to find a perm solution. Thanks, ~Francisco
Posted by ScreamNet, 07-20-2008, 07:30 PM
First off, I do not want this to appear as a flame but all you guys who keep saying "Reply to our tickets" are a bunch of bloody crazies, The techs are doing their best to bring us back online and you want them to sit and type "We're experiencing a problem, will be back online soon" a million times? I do not think so. I am currently offline(in kib5's rack) and I am no less-pleased with their network as i was before the outages began, SteadFast has not had any outages to my knowledge since we moved to them in november of last year, Karl has done a great job on keeping our network secure and preventing any major castrophe from happening in the operation of VerityNet and ScreamNet and all other companies on kib5's rack. Aswell as the operation of Frantech... We all feel the burn of this outage, but we as customers cannot do anything about it and sending in tickets will ONLY cause longer downtimes. SteadFast has an SLA for this reason, claim your SLA when they come back online and stop bugging them right now. NOTICE: I AM IN NO WAY AFFILIATED WITH FRANTECH OR VERITYNET OR STEADFAST!
Posted by Mekhu, 07-20-2008, 07:38 PM
Karl, please check your PM's. I have no idea if it's a waste of time or not but it may help.
Posted by David, 07-20-2008, 07:42 PM
*sighs* This is exceptional.
Posted by KarlZimmer, 07-20-2008, 07:49 PM
We think we've found the issue and things should start getting better shortly, but there aren't any guarantees. It is taking some time for the null-route to propagate, etc. This attack was even larger than this mornings, probably the largest we've ever seen...
Posted by Mekhu, 07-20-2008, 07:50 PM
We'll report how it goes.
Posted by Dougy, 07-20-2008, 07:50 PM
How big? That must have been massive..
Posted by rulereric, 07-20-2008, 07:52 PM
I hope he does tells use how big, but usually they wont bc it gives info to the person attacking.
Posted by KarlZimmer, 07-20-2008, 07:53 PM
Also, to update, we are planning to rush the network we had been planning to have completed in October and will have people flying in to try to complete it this month, if possible. That will decrease the complexity of our network, making the effects of such attacks significantly smaller, as it will be balanced over more equipment, and overall making these attacks easier to diagnose.
Posted by David, 07-20-2008, 07:56 PM
Yay, so Captain Ray will be back in for a few days. Now if only he'd stick around so we could return to flawless
Posted by dkitchen, 07-20-2008, 08:00 PM
Karl, Why not announce your allocation as /24's, minus the block that's being attacked. That way the vast majority of your network would be back online and any DDoS traffic simply wouldn't get anywhere near your network... Dan
Posted by KarlZimmer, 07-20-2008, 08:01 PM
The issue right now is getting anything done on the network with the CPU on the core switches being completely overloaded. We had entered the null route for that block awhile ago, but it is taking awhile for it to be processed, etc.
Posted by ScreamNet, 07-20-2008, 08:03 PM
Karl- Why do you not just reboot the router? Chris
Posted by blahrus, 07-20-2008, 08:06 PM
It might get a new IP that way, and that would help stuff out.
Posted by ScreamNet, 07-20-2008, 08:07 PM
blahrus: The router is configured to use static ip addresses. They do not get "new ips" when it is restarted. I was suggesting to kill the router long enough for the flood to stop and bring it back online with the victim IPs null-routed.
Posted by blahrus, 07-20-2008, 08:08 PM
I am very familiar with their setup. Last edited by blahrus; 07-20-2008 at 08:09 PM. Reason: no need to get that smart.
Posted by streaky, 07-20-2008, 08:10 PM
There was me assuming that was a joke..
Posted by ScreamNet, 07-20-2008, 08:10 PM
i was not notified by email about your joke before i posted making myself look like an asshate -removed-
Posted by dkitchen, 07-20-2008, 08:11 PM
Null route, or have you excluded it from BGP announcements? A null route is of little use in this scenario as it's still hitting your equipment / service provider equipment, we have been hit like this a number of times and the answer is to not announce the netblock by announcing your /16 or whatever as individual /24's.
Posted by blahrus, 07-20-2008, 08:14 PM
Yea, but l3 is going to need to be called directly to make sure they get it taken care of. ^^^^ that was a joke.
Posted by rapturetrumpet, 07-20-2008, 08:24 PM
Any new updates yet. I currently have 4 servers there, didnt get any response from steadfast yet.
Posted by scooby2, 07-20-2008, 08:25 PM
dkitchen is correct. It is ugly but it is the best way to mitigate the attack if your upstreams cannot do anything. The usefulness of null routing went out the window when DDOS started becoming popular. Get enough cable or dsl connected zombies and you can bring anyone to their knees.
Posted by Armageddon21, 07-20-2008, 08:26 PM
I am no expert in router, but cant you unplug the uplink, thus no more attack is reaching the router at all. then you can change all the setting you want them replug the feed? Maybe you cant, just a idea like that.
Posted by silver_2000, 07-20-2008, 08:26 PM
Im another customer whos ticket was simply closed An hour or more of downtime this morning for 2 servers and now going on a couple hours of downtime this evening Losing money and customers every 15 min
Posted by deadly twin, 07-20-2008, 08:26 PM
do we know an estimate of size of ddos bandwidth attack?
Posted by KarlZimmer, 07-20-2008, 08:28 PM
Well, it would be a BGP null-route, broadcast to all of our carriers, preventing it from reaching our network, not just null-routing it once it gets to our network.
Posted by IPv6, 07-20-2008, 08:29 PM
the cool kids are using 100/1000mbit unix bots
Posted by JohnForsythe, 07-20-2008, 08:30 PM
Amazon S3 dies for the entire day, taking out hundreds/thousands of websites. And now the entire Steadfast network seems to be down. Interesting coincidence.
Posted by ScreamNet, 07-20-2008, 08:31 PM
WooHoo! WTG Karl! KarlZimmerman.youaremighty.com Woot! Btw, SteadFast is back online!
Posted by RodrigoBR, 07-20-2008, 08:33 PM
My servers are UP again. Thanks.
Posted by KarlZimmer, 07-20-2008, 08:33 PM
OK, seems the router was finally able to process the null route and it has been broadcast, things seem to be returning to normal.
Posted by sHuKKo, 07-20-2008, 08:37 PM
Karl, according to this you are using Riorey devices for ddos protection. Just because of your very positive posts about Riorey I contacted them about a month ago and after long negotiations I am going to pay them for a gigabit unit tomorrow. So I guess I must think again now. I know how do you feel now when your network is totally down. Believe me I know it because my small network was also down for the whole saturday Because of overloaded router dropping BGP connections and there is nothing I can do to prevent it. Please tell me Karl what should I do now? You were my Riorey hero... Now just before I am preparing to pay them this happenned to you... Should I take the risk and pay for the unit to see how it works for my network? Or totally drop them ??? Please tell me ...
Posted by KarlZimmer, 07-20-2008, 08:38 PM
Just to note, this attack was larger than the one this morning and was aimed directly at our core switches, not the same customer, though the attack itself seemed to be similar in profile. Filtering the remainder of the customer's IP blocks had not accomplished anything, and during the period of high utilization it took a long time to run any of the commands to discover exactly where the issue was. This appears to have been a malicious attempt aimed directly at our network, though at this point, I would be unsure as to the reason for the attack.
Posted by KarlZimmer, 07-20-2008, 08:40 PM
The RioRey's are only rated for a certain number of packets per second. The number of packets we were seeing far exceeded the limits of the devices, thus the attacks got through. With the new network changes we are planning to upgrade to their higher capacity 10 GigE units, which should be released shortly.
Posted by Armageddon21, 07-20-2008, 08:41 PM
Were doing the Yo yo again
Posted by scooby2, 07-20-2008, 08:43 PM
Easier to block if less machines are attacking but even if they are just attacking a single ip or a single customer dropping the /24 can help (though some customers get to take one for the team until it stops). Either way I'm sure Karl and crew are on top of it!
Posted by ScreamNet, 07-20-2008, 08:44 PM
We're all still online hopefully it stays like that Karl, just hunt em down, get you a big stick and give em a good whack
Posted by KarlZimmer, 07-20-2008, 08:46 PM
There is still risk for some short blips and/or packet loss as the load on the core switches is still quite high because of the high amount of filtering and rate limiting, etc.
Posted by sHuKKo, 07-20-2008, 08:53 PM
Karl can you please tell me what kind of attack is this: In my situation it's a spoofed syn flood attack much bigger than my pipe plus the attack itself is just aiming at nowhere but directly to my network. Thousands of randomly spoofed ips sending millions of tcp syn packets to randomly on a /24 mostly unpopulated. But used mainly for my noc operations. Attacking Ips were random and spoofed - Attacked /24 ips were random and changing in every second or two - And also Attacked ports were also changing in every second or two. So Impossible to null route - Only thing I can do is to drop entire /24 and then 2 mins later attack starts on the other /24 I have... Karl can you please tell me is your attack patterns are similar to this?
Posted by Senad, 07-20-2008, 08:58 PM
Cisco Catalyst 6500 Series can take through 320 Gbps of throughput (40 Gbps per slot) however DDoS mostly was a cause of Buffer Overflow SYN Packets as they usually are. Adding capacity only adds more throughput for the attacker which in turn will be used against your network equipment. Since this is a core router setup shouldn't you be able to balance it out and null route via BGP accordingly? Or more preferably null route via one router via BGP and locally dropping packets to ensure the buffer is depleted faster for the bad traffic (so you can send out that BGP timeout earlier). RioRey's are also based on a mathematical formula instead of doing a deep packet inspection such as tipping point so it is not going to be as accurate as a Deep Packet inspection Solution.
Posted by deadly twin, 07-20-2008, 09:12 PM
it looks like we are in the clear? can we get a confirm karl z?
Posted by KarlZimmer, 07-20-2008, 09:27 PM
I can say it is similar, but not exactly the case. Overall yes, some of the aspects you listed did occur, making it difficult to filter properly. And Senad, by adding more capacity I meant more capacity to handle such attacks, meaning the new 10 GigE RioRey modules, the attack was just too many packets for the RioRey devices to handle properly, and additional core routers, to spread out the load.
Posted by 1h3a3c7k, 07-20-2008, 10:59 PM
anyone have exact times that the network was down, or what percentages by the sla we are entitled to receive? I dont have "perfect" logs for anything related to uptime, but my best guess is downtime was around 2 hours
Posted by Scott.Mc, 07-20-2008, 11:01 PM
Basically if it's down for 100 minutes you are entitled to 100% refund and it was down for a total of more than 100 minutes.
Posted by silver_2000, 07-20-2008, 11:08 PM
Best guess is about and hour in the AM starting about 8 CST and about 2 hours in the pm starting about 4 cst the posts here and the posts or lack of posts on the forums on my servers should narrow the time down further. Im sure that Steadfast has detailed records of the times and they will take the appropriate steps Doug
Posted by Chrysalis, 07-20-2008, 11:15 PM
dont know about 5 minutes but my server has been up and down at least 3 or 4 times and sometimes down for a good 30 minutes.
Posted by chopsmidi, 07-20-2008, 11:54 PM
My records are attached.
Posted by epic59, 07-20-2008, 11:55 PM
according to my syslog packets being transmitted from there, 118 minutes and 35 seconds (give or take 5 seconds as I only get a heatbeat keepalive every 5 seconds).
Posted by GsX GrimReaper, 07-21-2008, 12:36 AM
Now that it's over, and I've read the entire thread, my comments are probably moot, but here goes. I switched to Steadfast due to comments read on these forums long ago, and I've had issues here and there, mostly faulty components, and nothing the fault of Steadfast. PC parts break people. Same goes for attacks and outages not being the fault of the provider. I would prefer people be busy working on stuff than answering the phones. Hell, I couldn't even get a call to go through, and the entire steadfast.net site was down, so I couldn't have checked a status page either. I do have some minor suggestions along with the stuff you already mentioned will be happening soon. 1. OFF-SITE your entire site. Either on a totally different network, or preferably city. If the site goes down, it's not an issue that effects customers, and vice-versa. Yeah, I realize it's funny for a provider to pay someone else for webspace in another city, but until you expand, it's not gonna do any good to create info pages and private forums if they go down with the network. This is why I host my site from a friend on Dallas networks when my dedicated box is in Chicago. 2. Manage.Steadfast.net system should have a link to a status page as mentioned with the blah, blah, over-my-head info and indicators, but put a simple blog on the side for RECENT PROBLEMS. You can send data to blogs via cell phones these days, or so I hear, since I can't even recieve pics and sound on mine. Extremely simple info like time/date network down - working on it. Update a public area later as an announcement like it's been forever. 3. PLEASE put the Community forums link on the top of the site with the others. As I'm typing this, and talking to a friend, he informed me you had forums, cause I sure never saw that little writing amoungst the large headers and stuff I really look at. BTW, nothing on the page about any of these attacks in the News and Blog Updates section. Instead I found this thread via Google for Steadfast down. That's all I got. As always confident things are getting done. Thanks. BTW, when processing SLA for all these people bitching, process mine, make the check out for yourself, and go grab a cold one or two for the gang, cause yall deserve it.
Posted by IPv6, 07-21-2008, 12:59 AM
hm, down for anyone else?
Posted by bmhrules, 07-21-2008, 01:00 AM
And here I was hoping I would just get free rails for my overly large case.
Posted by layer0, 07-21-2008, 01:00 AM
All looks fine here. Towards the bottom left of the home page, check the latest threads on the forum - the first link here has info on the outage(s).
Posted by IPv6, 07-21-2008, 01:04 AM
heh, rdp session just timed out. ipmi shows that it's up though, so no idea what's going on
Posted by jayzee, 07-21-2008, 05:27 AM
Check the time that i first create this thread.. It happen since yesterday. 1 big DDOS and follow by another big network issue today.. separate occassion ..definitely more than 30min..
Posted by Popsikle, 07-21-2008, 10:31 AM
Seems kinda silly that the "DDoS Protection" they advertise on the website hasn't helped them yet and they have to upgrade the whole network before it will help.... I wonder what they currently use [edit RioRey's were mentioned later in the thread], and if its too late for them to get the money back for it! Last edited by Popsikle; 07-21-2008 at 10:37 AM.
Posted by BostonGuru, 07-21-2008, 02:29 PM
I am sure their DDoS protection has helped a lot with smaller attacks. Think of it as a bank with a couple of guards. It will protect against thugs who show up with a knife and a money bag, but if 5 guys bust through the bank in an armored truck and automatic weapons, then there is not much that can be done.
Posted by Armageddon21, 07-21-2008, 08:38 PM
... Down and down. please fix this MUCH faster then the other times.
Posted by Apolo, 07-21-2008, 08:41 PM
Yes, down again... 15 172 ms 272 ms 198 ms 12.122.99.25 16 177 ms 176 ms 193 ms 12.86.65.18 17 171 ms 176 ms 175 ms 216.86.149.61 18 * * *
Posted by VINAX, 07-21-2008, 08:47 PM
It looks like the performance network is down only. We are on the standard network, and it's up.
Posted by Armageddon21, 07-21-2008, 08:48 PM
We are on both and its both down.
Posted by bhill, 07-21-2008, 08:49 PM
Anyone else unable to reach their servers hosted at Steadfast? Our servers have been unreachable for about 15 minutes now. Any information guys?
Posted by David, 07-21-2008, 08:52 PM
This is really beginning to undermine my bottom line.
Posted by ManagerJosh, 07-21-2008, 08:52 PM
We're already aware of the issue and Karl has a team working on it. I will post more information as I receive it from Karl.
Posted by VINAX, 07-21-2008, 08:53 PM
It's strange. All of our servers are online.
Posted by David, 07-21-2008, 08:56 PM
Seems chunks of the network are out and latency is eating me alive.
Posted by BostonGuru, 07-21-2008, 08:58 PM
The shared account I have with them is offline again as well, plus the nameservers the domain I have parked there (dunno if its the same machine).
Posted by ManagerJosh, 07-21-2008, 09:00 PM
A general update as of 7:58PM Central Time We're currently experiencing issues with one of our shared hosting switches. Repair efforts are underway to restore service to all affected customers.
Posted by David, 07-21-2008, 09:01 PM
If it's related to your shared hosting switches, any specific reason why it's affecting 65% of my machines? (Please avoid feeding ********, I'm still full from the weekend's events)
Posted by wavenumber, 07-21-2008, 09:01 PM
Yes, seems to be down again. This is getting annoying. I feel sorry for them. They haven't had a break in the last 3 days. Obs: shared hosting here Last edited by wavenumber; 07-21-2008 at 09:02 PM. Reason: incomplete
Posted by KarlZimmer, 07-21-2008, 09:02 PM
Most everything should be up. We've had a distribution switch just die, completely, no lights, anything. The only things that should actually be down are things directly connected to the distribution switch and much of our shared hosting services.
Posted by Armageddon21, 07-21-2008, 09:03 PM
Hello, we have a half rack and were still completely down. Edit: Your phone are not working, your site is not working.
Posted by Scott.Mc, 07-21-2008, 09:08 PM
Rubbish. We are still having major problems. Outbound routes are essentially non existant. 1 (208.100.56.13) 0.437 ms 0.390 ms 0.406 ms
Posted by bhill, 07-21-2008, 09:11 PM
Anyone from Steadfast have a time estimate or more information as to what's going on?
Posted by KarlZimmer, 07-21-2008, 09:13 PM
Is that your server's IP? I'm not seeing any loss or issues to that IP.
Posted by ManagerJosh, 07-21-2008, 09:13 PM
I'm sorry you feel that way, and I understand your frustration from this weekend.
Posted by Scott.Mc, 07-21-2008, 09:14 PM
*slits wrists* $ traceroute 206.251.72.26 traceroute to 206.251.72.26 (206.251.72.26), 30 hops max, 40 byte packets 1 (208.100.56.13) 0.437 ms 0.390 ms 0.406 ms 2 (216.86.149.60) 0.473 ms 0.473 ms * 3 (12.86.65.17) 0.607 ms 0.541 ms 0.577 ms 4 (12.122.99.86) 62.902 ms * * 5 (12.122.17.201) 62.523 ms 62.578 ms 62.526 ms 6 (12.122.4.121) 62.313 ms * * 7 * * * 8 * * * 9 * (12.122.104.5) 61.937 ms 61.826 ms 10 (12.116.103.34) 61.752 ms 61.879 ms 61.819 ms 11 (216.66.254.253) 64.495 ms 65.465 ms * 12 (206.251.72.26) 62.176 ms 62.519 ms 62.656 ms
Posted by Armageddon21, 07-21-2008, 09:17 PM
Karl check PM . Were still completely down! we need this up ASAP.
Posted by Littleoak, 07-21-2008, 09:18 PM
All of our services are accessible at the moment. We haven't had any downtime tonight.
Posted by David, 07-21-2008, 09:19 PM
Just about everything in 208.100.x is completely dead in the water. Note both of your resolvers seem to be out as well or inaccessible via network.
Posted by Armageddon21, 07-21-2008, 09:20 PM
Were almost all on 208.100.X.X
Posted by Scott.Mc, 07-21-2008, 09:21 PM
Maybe I should add, this isn't "a server" this is everything, even your resolvers are down. Yet you don't know anything about it...... $ traceroute 216.86.146.8 traceroute to 216.86.146.8 (216.86.146.8), 30 hops max, 40 byte packets 1 (208.100.56.13) 0.493 ms 0.442 ms 0.429 ms 2 (216.86.149.60) 0.429 ms 0.437 ms * 3 * (216.86.149.61) 0.416 ms 0.441 ms 4 * * * 5 * * * 6 * * * 7 * * * 8 * * * 9 * * * 10 * * * How about your resolvers from another location, # traceroute 216.86.146.8 traceroute to 216.86.146.8 (216.86.146.8), 30 hops max, 40 byte packets 1 10.20.78.194 (10.20.78.194) 0.060 ms 0.056 ms 0.025 ms 2 po51.cer01.sea01.seattle-datacenter.com (67.228.118.133) 0.263 ms 0.316 ms 0.322 ms 3 * * * 4 * * * 5 12.118.34.17 (12.118.34.17) 0.791 ms 0.706 ms 0.718 ms 6 tbr1.st6wa.ip.att.net (12.127.6.193) 48.909 ms 48.856 ms * 7 * * * 8 cr1.cgcil.ip.att.net (12.122.31.161) 48.408 ms * * 9 * * * 10 12.122.99.25 (12.122.99.25) 48.152 ms 48.205 ms 48.456 ms 11 12.86.65.18 (12.86.65.18) 48.460 ms 126.593 ms 106.474 ms 12 216.86.149.61 (216.86.149.61) 59.712 ms 59.548 ms 59.737 ms 13 * * * 14 * * * How about google? $ traceroute 64.233.167.99 traceroute to 64.233.167.99 (64.233.167.99), 30 hops max, 40 byte packets 1 (208.100.56.13) 0.571 ms 0.422 ms 0.416 ms 2 (216.86.149.60) 0.406 ms 0.430 ms * 3 * (208.173.176.217) 0.635 ms 0.668 ms 4 (204.70.194.245) 0.751 ms 0.734 ms 0.675 ms 5 (208.174.224.10) 0.619 ms * * 6 * * * 7 * * (66.249.94.133) 1.579 ms 8 (64.233.175.42) 1.769 ms (72.14.232.74) 1.838 ms (64.233.175.26) 13.991 ms 9 (64.233.167.99) 1.946 ms 1.794 ms 1.870 ms That's right, it's everyone elses problem. Not yours.
Posted by dgarbus, 07-21-2008, 09:21 PM
Our boxes are online but we are unable to access any websites. A simple 'host' or 'traceroute' command for google.com (and many others) times out.
Posted by ub3r, 07-21-2008, 09:22 PM
This appears to be due to your system's nameservers not being able to query our dns servers because each are hosted under the same shared switch. If you reconfigure /etc/resolv.conf to use the following: nameserver 4.2.2.3 nameserver 4.2.2.4 it should be alright. As for those two att hops, i believe that may be standard att icmp dropping, however I could be wrong. Try tracing out to 4.2.2.3 to hit level3's net, to verify connectivity.
Posted by Scott.Mc, 07-21-2008, 09:23 PM
Oh mikey, stop. Honestly, stop.
Posted by David, 07-21-2008, 09:23 PM
They're *your* resolvers and internal network routes that are screwed, not our local settings. 208.100.x is completely inaccessible within your network -- 65% of my systems are offline as a result, how can you not see this?
Posted by bhill, 07-21-2008, 09:24 PM
Our servers are still down. [root@luv2spd ~]# ping chud-cm.nexcess.net PING chud.com (208.100.12.90) 56(84) bytes of data. --- chud.com ping statistics --- 92 packets transmitted, 0 received, 100% packet loss, time 90984ms
Posted by Apolo, 07-21-2008, 09:26 PM
Still down here. 14 214 ms 216 ms 214 ms tbr1.cgcil.ip.att.net [12.122.17.150] 15 175 ms 192 ms 177 ms 12.122.99.25 16 179 ms 178 ms 173 ms 12.86.65.18 17 175 ms 172 ms 177 ms ip61.216-86-149.static.steadfast.net [216.86.149.61] 18 * * * 19 * * * And Steadfast.net is also down. And no, it is not only from my location: http://www.checkdns.net/quickcheck.a...net&detailed=1 Regards,
Posted by ub3r, 07-21-2008, 09:26 PM
have you looked at your dns resolver config, and attempted query an off-network domain with dig? dig google.com @216.86.146.8 - may not work dig google.com @4.2.2.3 - should work. [root@web5 ~]# dig google.com @4.2.2.4 ; <<>> DiG 9.3.4-P1 <<>> google.com @4.2.2.4 ; (1 server found) ;; global options: printcmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 45035 ;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;google.com. IN A ;; ANSWER SECTION: google.com. 211 IN A 72.14.207.99 google.com. 211 IN A 64.233.187.99 google.com. 211 IN A 64.233.167.99 ;; Query time: 2 msec ;; SERVER: 4.2.2.4#53(4.2.2.4) ;; WHEN: Mon Jul 21 20:20:41 2008 ;; MSG SIZE rcvd: 76 [root@web5 ~]# dig google.com @4.2.2.3 ; <<>> DiG 9.3.4-P1 <<>> google.com @4.2.2.3 ; (1 server found) ;; global options: printcmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 27069 ;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;google.com. IN A ;; ANSWER SECTION: google.com. 135 IN A 72.14.207.99 google.com. 135 IN A 64.233.167.99 google.com. 135 IN A 64.233.187.99 ;; Query time: 1 msec ;; SERVER: 4.2.2.3#53(4.2.2.3) ;; WHEN: Mon Jul 21 20:25:27 2008 ;; MSG SIZE rcvd: 76 web5 happens to be on a different switch. i know what i'm talking about scotty
Posted by Armageddon21, 07-21-2008, 09:26 PM
For us 100% down, not a single ping since the last hour. Weve been down over 4 hours in the last 2 days.
Posted by Scott.Mc, 07-21-2008, 09:27 PM
No, you do not know what you are talking about. YOUR resolvers are down because YOUR network is not allowing connectivity in the 208.100 range. << removed >> Is it everyone elses problem and not yours? Last edited by writespeak; 07-21-2008 at 09:34 PM.
Posted by bhill, 07-21-2008, 09:33 PM
Whoa guys, a flame war isn't going to get the problem resolved any quicker. The 208.100.*.* range is still inaccessible. Any estimate on when this is going to be back up? Customers are getting fidgety...
Posted by Armageddon21, 07-21-2008, 09:35 PM
We need a real update and details on this issue. Since your phone do not work, your web site, and no offsite status page. Were in the complete dark and it is extremely frustrating. Anyone got a hold of someone at steadfast with more info?
Posted by JohnForsythe, 07-21-2008, 09:39 PM
Steadfast.net is down for me. My server, however, is up. Steadfast's DNS resolver seems to be down, though. I can't look up any domains with it.
Posted by David, 07-21-2008, 09:42 PM
And we're up, just beyond 65 minutes down for this evening.
Posted by magnify, 07-21-2008, 09:42 PM
I'm on shared hosting, everything is down for me. Email, Web, Mysql. I also can't access steadfast's main website.
Posted by jayzee, 07-21-2008, 09:47 PM
216.86.146.8 resolver is up for me.
Posted by VINAX, 07-21-2008, 09:48 PM
Looks like everything is backed up.
Posted by Armageddon21, 07-21-2008, 09:49 PM
Still completely down here
Posted by Apolo, 07-21-2008, 09:49 PM
Now it's up, finally.
Posted by KarlZimmer, 07-21-2008, 09:51 PM
If you just came up now then the issue was likely with DNS resolvers, not with a network outage, as we just got our DNS resolvers up, other than that there have been no network changes, though we're gradually moving individual clients who were single homed on that distribution switch.
Posted by David, 07-21-2008, 09:52 PM
Karl, Again, I can assure you that was not entirely related to the DNS resolvers. Your statement that it was 'DNS' resolvers is an outright lie -- how would DNS resolvers prevent us from accessible our ip space directly?!
Posted by layer0, 07-21-2008, 09:52 PM
Are you sure? Your own site was actually inaccessible...
Posted by David, 07-21-2008, 09:53 PM
As were at least half of our systems, obviously "resolvers" played a role in disallowing us to connect to the network outright! I feel like a broken record though, off to play a tune somewhere else.
Posted by Armageddon21, 07-21-2008, 09:58 PM
Any ETA for us? Check your PM
Posted by layer0, 07-21-2008, 09:59 PM
Yep, I just figured using their site as an example would be the easiest way to convey the point...
Posted by Apolo, 07-21-2008, 10:00 PM
Karl, I even posted a traceroute result on this very same thread and a link from a web tool, and also your main web site was down as well. I don't believe this was just a *simple* DNS resolvers issue... Regards,
Posted by bhill, 07-21-2008, 10:01 PM
All of our servers are still down as of 10PM. Anyway I could get you on the phone for a second Karl?
Posted by David, 07-21-2008, 10:01 PM
The point isn't getting across, oddly enough. :/ Case of the bodysnatchers?
Posted by ManagerJosh, 07-21-2008, 10:03 PM
If your servers are still down, please feel free to open a support ticket so someone can look at it. support.steadfast.net
Posted by Armageddon21, 07-21-2008, 10:10 PM
Ticket ID: VGZ-133084
Posted by KarlZimmer, 07-21-2008, 11:45 PM
I apologize for not giving more updates, but we've been re-running cables like crazy here....
Posted by Littleoak, 07-22-2008, 01:11 AM
SteadFast sent out an email explaining yesterday's problem. Thank you for the explanation, Karl, it's much appreciated.