Netriplex down?

Portal Home > Knowledgebase > Industry Announcements > Web Hosting Main Forums > Providers and Network Outages and Updates > Netriplex down?

Posted by UH-Bobby, 12-15-2010, 05:04 PM
Hey all, It's appearing that Netriplex is down. I'm not able to reach any of my equipment nor their site. I'm currently calling them. Is anyone else down? Thanks!
Posted by Eleven2 Hosting, 12-15-2010, 05:04 PM
Yes, it is down here for me as well...
Posted by dmtinc, 12-15-2010, 05:05 PM
for me 2 ...
Posted by JTY, 12-15-2010, 05:05 PM
We're down as well. I'm trying to call them, but on hold.
Posted by eleven2jon, 12-15-2010, 05:05 PM
Yeah we are having the same issues.
Posted by Warpline, 12-15-2010, 05:05 PM
Yes, we're down as well.
Posted by linuxissues, 12-15-2010, 05:05 PM
Anyone knows what's happening ?
Posted by brc_csf, 12-15-2010, 05:08 PM
Everything seems to be down. All our servers, their site, etc.
Posted by UH-Bobby, 12-15-2010, 05:08 PM
We are also on hold.
Posted by Eleven2 Hosting, 12-15-2010, 05:11 PM
I have tried to call them but are getting 'Circuits are busy'.
Posted by UH-Bobby, 12-15-2010, 05:12 PM
Post already made in outages area.
Posted by linuxissues, 12-15-2010, 05:12 PM
yup same here.. I hope this dont take too long.. cant get them on the phone.
Posted by ishan, 12-15-2010, 05:13 PM
Down for us as well. Repeat of last time's incident ? 1. Traceroutes don't leave originating country 2. Portal isn't accessible even after it was made redundant.
Posted by JTY, 12-15-2010, 05:13 PM
You're not missing much, just crappy hold music that is way too loud.
Posted by UH-Bobby, 12-15-2010, 05:14 PM
Just got off phone, the call center reps aren't sure what the problem is yet.
Posted by bhavicp, 12-15-2010, 05:14 PM
Down here as well. Maybe the portal was never made redundant =/ Though this REALLY should be done.
Posted by crventures, 12-15-2010, 05:17 PM
My VPS through 2Host (who seems to be using Netriplex) is down too. Traceroutes dying at static-98-140-159-37.dsl.cavtel.net (98.140.159.37)
Posted by UH-Bobby, 12-15-2010, 05:18 PM
Yeah seriously. I've got a backup community forum for stuff like this. It annoys me when I can't get through to them.
Posted by RyanD, 12-15-2010, 05:18 PM
They are entirely offline, even their peering session with us here in Atlanta is down.
Posted by JTY, 12-15-2010, 05:21 PM
Probably another failure with their "redundant" fiber.
Posted by UH-Bobby, 12-15-2010, 05:21 PM
We'll see if this is major, but this is the second outage in 2 months and 9 days.
Posted by pubcrawler, 12-15-2010, 05:22 PM
Traceroutes to Uber are really whacky. from our cable modem: 8 tge1-2.ncntoh1-swt402.neo.rr.com (24.164.111.241) 25.715 ms 25.927 ms 26.677 ms 9 gig13-0-0.ncntoh1-rtr2.neo.rr.com (24.164.104.93) 25.089 ms !H * * Never leaves to RoadRunners network to connect up with Atlanta and over to Asheville. Uber has an upstream provider down it appears or has routed themselves into a non existent black hole. Same story from another location in Texas. Never leaves the facility. Can't get to uberbandwidth.com or netriplex.com either
Posted by ishan, 12-15-2010, 05:23 PM
Indeed. We keep all our corporate websites at Stormondemand for the same reason. Clients should be able to contact you ESPECIALLY when you are down
Posted by JTY, 12-15-2010, 05:23 PM
Yeah, all of their BGP sessions are down. So, no network knows how to reach them, hence the routing issues.
Posted by UH-Bobby, 12-15-2010, 05:26 PM
If someone finds the cause of this, will you post it? I don't feel like being on hold for 15 minutes again to see if the first line people know anything about the outage yet.
Posted by Eleven2 Hosting, 12-15-2010, 05:27 PM
I am trying. I am on hold and will let everyone know as soon as they get up. Does anyone know if there are any other other providers in the facility not going over this 'redundant fiber' you can buy access to?
Posted by brc_csf, 12-15-2010, 05:27 PM
It would be great to have an official update or ETA so we know if we are waiting 20 minutes or hours (days? only if fire ).
Posted by crventures, 12-15-2010, 05:28 PM
Well, this is probably the "kick in the pants" that I needed to finish getting my failover VPS ready.
Posted by Eleven2 Hosting, 12-15-2010, 05:29 PM
Wow. I just spoke to them. They do not know what is going on. So I asked: "Is it network or power?" Answer: "We do not know yet."
Posted by brc_csf, 12-15-2010, 05:33 PM
Based on our experience every place goes down sooner or later. What is most important is how they handle this kind of issue, how fast they get back and how often issues like this happen (should be very rarely to be ok).
Posted by JTY, 12-15-2010, 05:35 PM
..... How can they not know if power is an issue or not?!?
Posted by UH-Bobby, 12-15-2010, 05:36 PM
I've been with Netriplex for about 10 months. I've been up until the outage on 10/4 and then this outage.
Posted by UH-Bobby, 12-15-2010, 05:37 PM
That must be sort of a canned response. I've been to the datacenter, they can see the cabinets from their operations center. Heck, they can walk out and see if it's power.
Posted by Eleven2 Hosting, 12-15-2010, 05:39 PM
Ya, they can see if the power is on, I understand their offices and lights and all might be on different circuits than their DC - but they should know if the power is down very easily.
Posted by brc_csf, 12-15-2010, 05:47 PM
they should have a network status page outside their network. Having no updates is really bad..
Posted by UH-Bobby, 12-15-2010, 05:49 PM
A basic page would be a piece of cake to do. We're over 45 minutes into this...
Posted by RyanD, 12-15-2010, 05:52 PM
Chances are their first line call desk is outsourced to an answering service, thus they have no answer.
Posted by UH-Bobby, 12-15-2010, 05:53 PM
It's possible.
Posted by bdwarr6, 12-15-2010, 05:53 PM
I called their sales department, and got that it is a network issue.
Posted by pubcrawler, 12-15-2010, 05:53 PM
I am about to pick my ball up and take it to another data center to play. You can't have a prime time during holiday shopping outage like this and have such a crap response about the cause and no ETA. Don't know if it's the electric?!?!? So much for proactive monitoring and common sense. I don't understand how their own websites are down. They offer services at many other datacenters (20+) including Boston which I keep seeing ads for. Making their own site able redundant is a redundancy 101 thing. Last outage left me sitting here with bad impression about them and their network. When they cancelled the explanatory conference call about the matter it wasn't a good decision on their part.
Posted by pubcrawler, 12-15-2010, 05:55 PM
From Twitter: itsgcorp @netriplex I got through to AVL01, major ATL link went down. And rerouting didn't happen.
Posted by RyanD, 12-15-2010, 05:59 PM
Both links are down, there are no routes in the global bgp table for them, they are completely offline. Pull up a looking glass from a carrier in Ashburn or Atlanta
Posted by crventures, 12-15-2010, 06:04 PM
From their twitter account "Netriplex DC has had a network outage and is working to resolve."
Posted by Edrick Smith, 12-15-2010, 06:12 PM
So it takes them over an hour to actually post an update? Not to mention it's still down.
Posted by sirius, 12-15-2010, 06:12 PM
REMINDER - This forum is provided to discuss current outage issues and to allow a way for customers and providers to communicate. Comments by non-customers will be removed. If you are not a current customer, there is no need for you to post. Reviews are only suitable in the appropriate main forum category. If you are not a customer, you should not be participating in this thread.
Posted by ssthormess, 12-15-2010, 06:13 PM
Any updates on this? On the phone they just say nothing, in fact.
Posted by cmanns, 12-15-2010, 06:18 PM
no updates ye tmate
Posted by pubcrawler, 12-15-2010, 06:23 PM
<> I am a customer, so pay attention. Been over an hour now and no response from Netriplex/Uberbandwidth. To inform those who are unaware, the last outage a bit over 2 months ago, was similarly mysterious and never explained by Netriplex/Uberbandwidth. A conference call scheduled after that event was cancelled without comment or explanation. An outage at 4PM on a Wednesday less than 10 days from Christmas is very bad timing on their part. Obviously, there isn't redundancy of connectivity to this facility and everything has to be backhauled to Asheville by private fiber. Unacceptable EXCUSES they are making up on the phone rather than just saying what this event is. Electric outage is very easy to detect and should be a monitored asset your control center is informed of if failed. Let's all demand an explanation about this outage from Netriplex. Growl! Last edited by sirius; 12-15-2010 at 06:44 PM.
Posted by WebManagerNY, 12-15-2010, 06:25 PM
Yes, I am a customer as well. Let's all demand a conference call this time from Netriplex. Not the promise of a conference call, then postponed, then canceled.
Posted by pubcrawler, 12-15-2010, 06:26 PM
From Twitter: NETRIPLEX Netriplex LLC Netriplex DC has had a network outage and is working to resolve. 22 minutes ago lisashea Lisa Shea ARRGGHHH I just sat on hold for 25 minutes with Netriplex to find out the status of this outage and I got disconnected! Have to start again! avl-1.xenserv.net disruption: Upstream datacenter issue with BGP Sessions (Netriplex) twitter.com/NETRIPLEX - No... ubservers UBservers Problem with our main DC netriplex. The whole DC is down: http://netriplex.com and our bandwidth provider as well: hhttp://uberbandwidth.com 17 minutes ago
Posted by dbatch, 12-15-2010, 06:34 PM
No point in even wasting your time. I talked to them and they have no idea of when it'll be back up. Last I heard it was 30 minutes about 45 minutes ago.
Posted by brc_csf, 12-15-2010, 06:35 PM
How does 100% SLA work ?
Posted by cyberdog, 12-15-2010, 06:39 PM
Wasn't mysterious at all. Check your email, they made a full report and released it internally. Local contractor made a big booboo with redundant fiber links. They do have redundant links, doesn't help if both are cut or if the BGP routing table is foobared (as it is now). Really annoys me they don't put the portal off network like we do though. From what I heard their fiber trunk to "their new" ATL facility went down and somewhere along the way soon after their routing table was foobared sending packets every which way until their time to live is over. Last edited by sirius; 12-15-2010 at 06:46 PM.
Posted by gone-afk, 12-15-2010, 06:40 PM
1. extraordinarily long outage, portal etc down despite promises from last time. 2. damage to your client base 3. small SLA credit to buy beer . . . . profit! 1 hour 50 minutes now.
Posted by cmanns, 12-15-2010, 06:41 PM
So you're saying it's power? I'm pretty sure it isn't. People were reporting downtime with netriplex earlier then it went down for me....
Posted by cmanns, 12-15-2010, 06:42 PM
100% sla is a lie bro
Posted by cyberdog, 12-15-2010, 06:46 PM
SLA is just an agreement, the agreement is they'll credit you for ANY downtime. I'm going to hold them to that this outage...I get a sneaky suspicion someone opened up the routing table in vi and didn't know the colon commands :-P. Just got off the phone with them, still no update but not once got any false information about power, I think your question may have been vague.
Posted by dbatch, 12-15-2010, 06:46 PM
It's not power, it's connectivity. At least that's what the rep told me when I called.
Posted by pubcrawler, 12-15-2010, 06:47 PM
DZone DZone by apaganobeleno In 9 more minute our ISP, Netriplex, will have reached 2 nine's of downtime: 99 minutes DOWN!! OMG!! 2 minutes ago ------------------- Thanks cyberdog for the info. Seems like their redundant fiber has some cross over point to fail and too much human foul up involved. For such a large facility/large bandwidth, this sort of stuff should be better protected, documented and isolated. Two pieces of fiber only really scares me. Compare that to their menu of blended carriers and realize those carriers are back there in Atlanta, not Asheville, across that weaker fiber link. Shame, I think their hands on folks are fairly competent when we've needed them. But this starting to get to regular, the outages. Couple that with the $1k cabinet offer reappearing and I imagine the next latency could start going the wrong direction as more high use folks populate cabinets.
Posted by dwessell01, 12-15-2010, 06:48 PM
I live in AVL. I just left the data center when this happened.. I was gone about 15 minutes when I started getting alerts (I've got a full cab there). It's definitely network related.. Nasty stuff.. I'm loosing money, and pissing off customers over this..
Posted by ssthormess, 12-15-2010, 06:49 PM
Netriplex.com is loading for me again.
Posted by PacketCollision, 12-15-2010, 06:50 PM
I talked to them about 20 minutes ago, and they said that they would be releasing an update in 30 minutes. I asked how they would deliver it, and they said "by email if we can, call back otherwise". The support rep said it was a "major network outage" but could/would not tell me any more. I asked if fiber had been cut, and he said he didn't know, but that again, in 30 minutes they should know more. I'll update here when I get off the phone with them again in 15 minutes or so.
Posted by RyanD, 12-15-2010, 06:50 PM
As a peering partner of Netriplex we have a link down which is impacting our customer's access to their equipment there. What you are pasting is wildly technically inaccurate and in fact makes no sense The routing table isn't something you can open up and edit in VI. *EDIT* Looks like it's back now but with more latency.
Posted by WebManagerNY, 12-15-2010, 06:50 PM
Can you provide details on the $1k cabinet offer?
Posted by Mike12King, 12-15-2010, 06:51 PM
They actually sent out a very detailed report of the incident to us. I'm sure you would have gotten it too.
Posted by dwessell01, 12-15-2010, 06:51 PM
THe portal and the website are coming back online.. My servers are not so far. dw
Posted by UH-Bobby, 12-15-2010, 06:53 PM
This is something you may want to contact their sales staff about.
Posted by UH-Bobby, 12-15-2010, 06:55 PM
Netriplex.com is responding, and portal.netriplex.com is also responding. We should be back up here shortly. We're at the 2 hour mark now. Mine went down starting at 3:55PM EST, and it's now 5:55PM EST.
Posted by bhavicp, 12-15-2010, 06:55 PM
Weirdly NetripleX.com is up, however our servers are still down. http://www.netriplex.com/solutions/specials/
Posted by pubcrawler, 12-15-2010, 06:56 PM
http://twitter.com/#!/SalesinCloud Colocation Special - $999/month Full cab 48U, 20amps120v, and 1000 Mbps of unmetered bandwidth. Deal ends Dec. 31, 2010. Contact me (Stuart Dodson)directly
Posted by cmanns, 12-15-2010, 06:56 PM
Yeah mine went down around 4:10 ish. Still down, same lost route.
Posted by dwessell01, 12-15-2010, 06:57 PM
That's exactly what I'm seeing. They're up, I'm not.. Good for them!
Posted by cyberdog, 12-15-2010, 06:57 PM
Sorry, I didn't mean to make a joke when people are pissed off, it's just what I do to keep my cool myself ;-). Anyone who has used vi will know that if you don't know the commands you can seriously screw up the document you are working on. I realize you can't run vi on the cisco IOS...it was a joke. In regards to if a fiber is cut, who knows but they had a trunk link to ATL down early on in this and since then routing has just gone haywire. Looks like they got netriplex.com back up 10 minutes ago though, nothing educational on their status page though: As of 12/15/2010 5:46:04 PM Eastern, we are experiencing a network issue in our Asheville01 facility. Network engineers are currently working to resolve this issue. Network engineers have isolated the outage to a BGP issue and are diligently working to implement a temporary fix.
Posted by pubcrawler, 12-15-2010, 07:00 PM
Yep portal and website are back online. Looks like route might have changed - perhaps on other piece of fiber: 10gigabitethernet1-1.core1.atl1.he.net (72.52.92.162) 26.607 ms 26.651 ms 26.706 ms 6 198.32.132.91 (198.32.132.91) 26.652 ms 26.664 ms 26.642 ms 7 te-4-0-0.rtr1.avl1.netriplex.com (67.23.161.129) 33.095 ms 33.208 ms 33.116 ms 8 te-5-4.rtr2.avl1.netriplex.com (67.23.161.134) 33.064 ms 41.439 ms 41.634 ms Still getting a null route to servers and IP space there though. We are at the 2 hour outage point now
Posted by Glassoholic, 12-15-2010, 07:03 PM
Our VPS with Netriplex is also down. I can now reach the support login screen but not the VPS itself.
Posted by gone-afk, 12-15-2010, 07:04 PM
Outage was noted by our systems at 2:55pm CST
Posted by pubcrawler, 12-15-2010, 07:04 PM
NETRIPLEX Netriplex LLC Netriplex Update: We have found the root cause to be a BGP flaw. The ETA is now 15 minutes or less for a full restoration. 1 minute ago
Posted by cyberdog, 12-15-2010, 07:05 PM
I agree. Technically they have more than 2 links, they have 4 entry points and more than 4 fiber lines but if you look at their report from before, enough of them were cut to where the other links couldn't sustain the traffic. I'd reference the report as they asked to keep it internal, but they claimed they were working to prevent a recurrence. This time around it just looks like a routing cluster you-know-what so it doesn't really matter if the fiber links are up or not... AND WE'RE BACK!
Posted by prickett233, 12-15-2010, 07:05 PM
Just received alerts that our servers are back online.
Posted by UH-Bobby, 12-15-2010, 07:05 PM
We're showing stuff coming back up now
Posted by Eleven2 Hosting, 12-15-2010, 07:06 PM
Nagios just let me know all is back online for us.
Posted by Postbox, 12-15-2010, 07:07 PM
Same here - We're back up too
Posted by cyberdog, 12-15-2010, 07:07 PM
Yay! Now to find out wtf happened....
Posted by ssthormess, 12-15-2010, 07:08 PM
My sites are back.
Posted by cmanns, 12-15-2010, 07:09 PM
XenServ.net is backup Ping is higher, normally 2-5ms now it's 9-15ms Darn, my minecraft server's going to lag harder, lame.
Posted by brc_csf, 12-15-2010, 07:09 PM
Still down. HOpe we are not the only ones.
Posted by Glassoholic, 12-15-2010, 07:10 PM
We're back online too, thankfully.
Posted by PacketCollision, 12-15-2010, 07:10 PM
Our rack is back up. Hopefully they will give a detailed report about the cause.
Posted by dbatch, 12-15-2010, 07:12 PM
They must be rolling it out rack by rack, we're still down too.
Posted by pubcrawler, 12-15-2010, 07:12 PM
NETRIPLEX Netriplex LLC Netriplex Update: We have found the root cause to be a BGP flaw. The ETA is now 15 minutes or less for a full restoration. 1 minute ago
Posted by Postbox, 12-15-2010, 07:14 PM
Someone mentioned power a long way back in the thread. Just to put your minds at rest (for those still down) I just checked a box there ... top - 18:09:33 up 80 days, 23:07
Posted by brc_csf, 12-15-2010, 07:17 PM
Hey. We are still down. If there is someone else still down, please let me know.
Posted by quad3datwork, 12-15-2010, 07:17 PM
Back online!
Posted by dbatch, 12-15-2010, 07:18 PM
still down too... support just told me 20-30% are still down.
Posted by pubcrawler, 12-15-2010, 07:20 PM
We are back up at of 6:07 Eastern Time... 12 minutes ago. Think that's 2 hours and 12 minutes down time. Indeed it a rolling effect to get back online it seems. Hang tight I'd say for the next 30 minutes.
Posted by brc_csf, 12-15-2010, 07:25 PM
Ok. Good to know. I just called them and at first the operator told me the issue was resolved.. After I gave him a test IP he asked me to wait 5 minutes and try agian. 2 hours and half downtime. Too bad.
Posted by RyanD, 12-15-2010, 07:25 PM
Sounds like some routers blew up and lost their configuration and they are having to manually rebuild them, at least our peering session is live so our customers can reach their gear there again
Posted by chmdznr, 12-15-2010, 07:34 PM
Mine is up again.. I guess it was routing problem.
Posted by brc_csf, 12-15-2010, 07:35 PM
not "Was" but "is". Still down here
Posted by dbatch, 12-15-2010, 07:36 PM
You and me both.
Posted by brc_csf, 12-15-2010, 07:44 PM
Called again. They said that they are not 100% restored yet. It would be great if everyone that is back could close there tickets so that they can handle clients that are still down.
Posted by SalesNetriplex, 12-15-2010, 07:49 PM
FYI, some who called the "Netriplex reps" called the sales line and not the tech support line which was jammed and flooded. The sales reps were not aware of the exact cause of the outage. The focus was to allow the tech team to work uninterrupted to determine teh cause. You can check out the Netriplex Twitter account to see updates. From a sale perspective, we will create a much more responsive and more customer-oriented notification venue.
Posted by brc_csf, 12-15-2010, 07:59 PM
3 hours down.
Posted by SalesNetriplex, 12-15-2010, 08:05 PM
I checked for you and was told that if any customer is still not online, please call the NOC. I see that you called brc_csf, so I know the team is working on your situation.
Posted by brc_csf, 12-15-2010, 08:18 PM
Our second subnet just got back the last minute. Although this is a really huge downtime (3 hours and 14 minutes), if we don't experience any other outage for the next 10 months this could be seen as acceptable/isolated incident.
Posted by prickett233, 12-16-2010, 04:04 AM
Anyone received an incident report yet for the outage?
Posted by bhavicp, 12-16-2010, 04:05 AM
Yes, titled "Update 2 to Netriplex AVL01 Outage" .
Posted by brc_csf, 12-16-2010, 08:36 AM
Have this incident also affected other Netriplex's facilities ? We were about to get a new rack and don't want to keep all eggs on the same basket..
Posted by IIIBradIII, 12-16-2010, 11:15 AM
No, just AVL.
Posted by NodePlex, 12-17-2010, 01:00 PM
Netriplex believes they have determined the root cause of the outage. They are putting measures in place this weekend to prevent it from occuring again.
Posted by pubcrawler, 12-17-2010, 01:12 PM
For the sake of those who are customers and have not received the emails from Netriplex/Uberbandwidth, I will post the three emails they've sent out in a minute.
Posted by gone-afk, 12-17-2010, 01:14 PM
They can contact netriplex themselves. It is always best not to post specific infrastructure information (model numbers) on the net, especially when they suspect a bug of some sort.
Posted by pubcrawler, 12-17-2010, 01:14 PM
December 15: As you are aware, we experienced a major network issue this afternoon that started at around 15:45 Eastern with a few BGP flaps followed by a full-on outage starting at 15:55. While service has been restored, we're not even close to closing out this issue. I am meeting with our network infrastructure team members right now to fully understand what transpired so I can provide you with the full details. We have a case open with Cisco TAC as well to bring additional insight into the events that transpired to cause an outage of this magnitude. Please look for an update from me within the next few hours. If you are still experiencing issues, please call our NOC at 800-619-8801 option 3 (+1 828 650 8585 Internationally). We are diligently working through all of the many tickets at this time. Sincerely, Jonathan Hoppe Chief Technology Officer Netriplex LLC
Posted by pubcrawler, 12-17-2010, 01:16 PM
Dear Netriplex Customer, Here is an additional update as of 12:45 AM Eastern on Thursday, December 16th. Our Sr. network engineers continue to comb through log files and configurations while working with Cisco TAC in an effort to determine the cause of the network outage that just transpired. Sadly, there is no smoking gun. No changes were made to the network at the time of the incident or even before, and no physical hardware or connectivity issues were to blame. What we are certain of is that the AVL01 edge routers dropped their aggregation route announcement to the core routers, effectively taking all customer prefixes off the Internet. This problem was preceded by a short BGP flap on one of our core routers providing connectivity to three downstream ISP customers. It is our current belief that this flap (and whatever caused it) somehow corrupted the session between the edge and the core in such a way as to prevent it from sending routes even after being manually reset. Because the BGP session between the edge and the core remained online (it was simply sending 0 prefixes), it was initially difficult to determine where the issue was. Once our engineers found the issue, they spent their time trying to restore the announcements. After an extended period of time including a full reload of all routers, they started to focus on a workaround which involved removing the announcements from the edge and sending them from the core instead. This, in addition to the implementation of static routes brought service back online for all customers. Unfortunately, the true cause is yet to be determined as of this writing. What is certain, is that since BGP is not currently implemented between our edge and core at tis time, the issue cannot repeat itself right now, which is somewhat of a relief. Only once Cisco and our engineers have determined what happened and how to ensure it can never happen again, will we consider implementing BGP between these layers again. Both Cisco TAC and Netriplex engineers believe that a Cisco IOS bug is to blame, but we have yet to confirm that this is definitely the case. Many customers have written me personally or have posted comments in public forums questioning our redundancy, and for good reason. However, rest assured that AVL01 has no shortage of router and switching redundancy. If youve seen our network diagram, you know our infrastructure is exceedingly robust. Even if we doubled our infrastructure, the issue would have still occurred since a dropped route announcement would have instantaneously propagated throughout any number of routers. The bigger challenge today is to determine what we can do to prevent this exact scenario from happening again. Once we have accomplished that, we can expand our discussion in an attempt to determine if something slightly different could affect our network in a similarly catastrophic way. At Netriplex were very proud of our network (except for today). Over the last two years we have spent all of our profits on improving it because we know that as the Internet becomes more content rich, our customers will appreciate a high performance network. Because of that, we probably have more capacity, redundancy and network providers than other datacenter similar in size. We always learn from our mistakes and not only implement technology to ensure mistakes never repeat themselves, but we implement policies and procedures to refine how we manage change to critical infrastructure as well. Today, Ive heard from customers loud and clear that communication during incidents like these is still insufficient. Despite implementing our Twitter page a couple months ago and increasing the telecom capacity in our NOC, we still failed at keeping our customers sufficiently updated. We will be discussing this very issue at length over the next few days and will share with you our plan to remedy that issue. As always, we value your business very much, and we do want to hear from you. Feel free to write me personally at jhoppe@netriplex.com with your comments, questions or concerns. The sheer volume of email I expect to receive will certainly delay my response to you, but rest assured your feedback will not disappear into a black hole. Your input is what we use to improve our company. We know an issue like this quickly racks up hundreds of thousands or even millions of dollars in lost business for all, but we will continue to improve our operations and eliminate our weakest links with the ultimate goal of 100% uptime in every facet of our business. As soon as I have further information to share with you, I will write again. Sincerely, Jonathan Hoppe Chief Technology Officer Netriplex LLC
Posted by pubcrawler, 12-17-2010, 01:18 PM
Dear Netriplex Customer, Here is an additional update for Friday, December 17th. In a few hours, our change management team will be sending out an emergency maintenance notification for this coming Sunday, December 19th beginning at 1:00 AM Eastern time. Working with Cisco TAC, our engineers believe we have narrowed down the cause of Wednesday’s outage to a known bug affecting the Cisco 6708 module when deployed in 7600 series routers. The 6708 module has 8 x 10 Gigabit ports that we use to cross uplink our edge to our core. All of the 10G connections on one particular module began to cycle up/down sporadically 10 minutes prior to the outage. While this up/down activity did not specifically cause any connectivity issues due to the redundancies in place, we believe that it was the ultimate trigger to the BGP failure between our edge and core which did cause the outage. We will be using this maintenance window to update to a later IOS version on all devices, add additional 10G modules to distribute the risk over multiple modules, and incorporate use of the 10G ports on the RSPs for additional redundancy using something other than a 6708 module – the only type we have used to date for connecting the edge to the core. As an aside, we use the ES20 module for connectivity from the core to our backbone nodes in Atlanta, GA and Ashburn, VA because it supports VPLS. The ES20 module is not affected by this bug. As always, we greatly appreciate your patience as we dissected the incident and worked through developing a concrete action plan. Many of you have emailed me personally with suggestions, and I do appreciate it. Keep them coming. I know we have a very long way to go to regain your trust, but that is my personal goal, and everything I will be focusing on between now and January 1st will revolve around that. I will write you on Monday after the maintenance has been completed to provide you with additional information. Sincerely, Jonathan Hoppe Chief Technology Officer Netriplex LLC
Posted by gone-afk, 12-17-2010, 01:22 PM
Thanks for posting model numbers on a public forum when the issue is not yet patched!
Posted by brc_csf, 12-17-2010, 01:23 PM
I also think that this info should not be public.
Posted by pubcrawler, 12-17-2010, 01:43 PM
Running an old IOS version? That Cisco's problem right? No it isn't. It's Netriplex's issue. Maintenance window should be done during low utilization window ASAP. No reason to delay. Upgrading IOS isn't that complex or time consuming. Is a model number or vendor information a security risk? It shouldn't be. Unless the gear isn't redundant and well thought out or is junky vendor gear. Cisco makes good products that deal with massive attacks and compromises all the time. By saying we use Cisco tells malicious folks to focus on IOS compromises by default. This outage wasn't the result of any malicious person, nor does this issue in any way allow such to capitalize. Perhaps choosing another router vendor for redundancy purposes is way to go. All eggs in one vendor basket could lead to such a problem as described and thus the outage. Goes back to folks concern about redundancy at this company and best practices.
Posted by gone-afk, 12-17-2010, 01:45 PM
Regardless of whose fault it is. It is just simply reckless to post such information when it is not necessary, and the issue not resolved. Regardless of how small the risk, you are placing all customers at further risk. This only reflects poorly on yourself for not handling the situation with due care.
Posted by pubcrawler, 12-17-2010, 01:53 PM
Wreckless is claiming redundancy massively then getting bit by a misconfiguration or documented issue due to failure to take upgrade software path. Hopefully, Netriplex updates all their locations in next 24 hours and doesn't delay for another 13 or more days and subject customers to another preventable outage. For the record, the explanation emails contained no disclaimer about not being for public consumption. Obviously, there needs to be some auditing process and certifcation for all these datacenters to monitor infrastructure, firmware, security releases, known exploits, upgrade paths, etc. Placing this on hired humans at each facility to keep informed proves a poor decision often.
Posted by RyanD, 12-17-2010, 04:26 PM
There is no mystery here, this would be known as damping. http://www.faqs.org/rfcs/rfc2439.html Because they were bouncing like a yoyo the routers purposely ignored the routes.
Posted by Ed-Freethought, 12-17-2010, 05:08 PM
It seems from those posts that the bug was not triggered remotely but as a result of a problem with a mix of a certain line card in a certain chassis running a certain IOS version. I'm not sure how you propose that someone is going to maliciously take down the service by knowing what model line card is in use (not that there are a lot of 10Gbps options to chose from that fit in a 6500/7600 chassis)... the posts don't give a certain IOS version, but even if they did you can get that quite accurately from fingerprinting with tools like nmap.
Posted by bhavicp, 12-17-2010, 05:09 PM
I really believe that those emails should have been kept private and is for customers of NetripleX only..
Posted by NodePlex, 12-17-2010, 05:31 PM
I agree, or if you just have to post them at least get permission from Netriplex. Current customers would have received this information already anyway, so there is not any point in doing so.
Posted by bhavicp, 12-17-2010, 06:09 PM
Just because they're running an old IOS, doesn't mean the bug doesn't exist in the newest version either (and i'm pretty sure it does). I'm sure if theres any MAJOR updates to the IOS, they will immediately upgrade their routers.
Posted by pubcrawler, 12-17-2010, 09:50 PM
The RFP on BGP flap damping is penned by a fellow from Cisco and dated 1998. BGP flap Damping is a semi-regular issue, when things go wrong, and that paper inclusive of methods, algorithms, etc. existed probably way before 1998. Far from a new issue, it's as old as BGP is. I find it hard to believe that Cisco has any current issue regardless of IOS version. There is a reason why they are the major router company globally. I am not saying Netriplex is distorting things either. It seems like the matter might have been above their network teams knowledge base. Perhaps they outsourced the network setup (not uncommon) initially? Maybe they were experiencing high load or DDOS? Again, this outage, like the prior two months ago and vast majority we read about on this forum are very preventable. I can only imagine the headaches folks faced with customers and lost revenue. It's a huge issue, especially this being the Christmas shopping prime time. Of course, this stuff does happen, but this one seems mighty preventable. Someone more versed with the Cisco gear and built in features surely is reading this and agreeing. I want Netriplex to get their redundancy birds in order and be more open about resolving things. You don't get the credit for having a great facility when you fall on your face ungracefully like this and juggle odd explanations. Netriplex can resolve this and become a truly redundant very high reliability company and receive more business than ever by plowing through this in correct way. There's been a lot of questions about their business model, actual bandwidth at that facility, their fiber runs, etc. Hopefully, they are forthcoming with information to match their marketing claims. I'll stick with them, but am shipping more of our gear out to other datacenters at this point. Lucky for me nothing customer wise is at their facility, just our own projects and major ones are DNS failed over to other datacenter through monitoring. Cisco IOS docs have a section on BPG flap damping: http://www.cisco.com/en/US/docs/ios/...html#wp1002400
Posted by shoutcast_server, 12-19-2010, 04:45 AM
2 of my servers at Netriplex are down again (12/19/10 @ 0030 hours PST). Is this just me, or is Netriplex having issues again ?
Posted by ishan, 12-19-2010, 04:47 AM
Just you. We are up. This might be related to the maintenance today.
Posted by shoutcast_server, 12-19-2010, 04:49 AM
Never mind, just read an e-mail that they are doing maintenance.
Posted by gone-afk, 12-19-2010, 05:01 AM
All mine went down, then 1 rack/subnet came back up but not the other. Can ping internally, but subnet in not reachable from outside the DC. So you're not alone. 36 minutes down and counting.
Posted by gone-afk, 12-19-2010, 05:38 AM
Mines back, 66 mins down. Hopefully you are all back online too. EDIT: Whoops, spoke too soon. Now my alll my subnets are completely offline. Last edited by gone-afk; 12-19-2010 at 05:43 AM.
Posted by bhavicp, 12-19-2010, 05:39 AM
I've had IPv6 go down for the past 66 minutes. Just seems to have come back up.
Posted by ishan, 12-19-2010, 05:41 AM
We just went down now. Back Up after 10 minutes. Down again. Last edited by ishan; 12-19-2010 at 05:51 AM.
Posted by cmanns, 12-19-2010, 05:58 AM
its planned downtime
Posted by pubcrawler, 12-19-2010, 08:10 AM
Anyone else noticing much higher latency following this maintenance window? Maintenance Time: 01:00 AM to 05:00 AM Eastern Time (06:00 to 10:00 UTC/GMT) Sunday 7AM Eastern now... From Netriplex to GIPNetworks in Dallas, TX: 2 te-1-8.rtr1.avl1.netriplex.com (67.23.161.133) 0.272 ms 0.330 ms 0.391 ms 3 te-5-4.rtr1.atl1.netriplex.com (67.23.161.130) 6.668 ms 6.720 ms 6.783 ms 4 10gigabitethernet1-1.core1.lon1.he.net (195.66.224.21) 101.245 ms 101.381 ms 101.250 ms 5 10gigabitethernet2-3.core1.nyc4.he.net (72.52.92.77) 172.439 ms 172.431 ms 10gigabitethernet4-4.core1.nyc4.he.net (72.52.92.241) 169.722 ms 6 10gigabitethernet2-3.core1.ash1.he.net (72.52.92.86) 175.841 ms 175.731 ms 175.511 ms 7 10gigabitethernet1-1.core1.dal1.he.net (72.52.92.61) 210.452 ms 210.714 ms 210.823 ms 8 216.66.77.126 (216.66.77.126) 211.757 ms 211.834 ms 211.915 ms 9 po1.dist2.gip.ntxc.net (67.43.48.137) 130.742 ms 130.590 ms 130.791 ms 10 gi0-2.cab603.gip.ntxc.net (66.207.170.14) 133.011 ms 130.806 ms 133.224 ms Traceroute from GIPNetworks back to Netriplex Asheville has changed also and shows high times albeit without HE bandwidth: 2 gi9-15.dist2.gip.ntxc.net (66.207.170.13) 0.328 ms 0.385 ms 0.445 ms 3 67.210.231.130 (67.210.231.130) 0.317 ms 0.351 ms 0.405 ms 4 dfw-edge1-eqx.peer.gipnetworks.com (209.107.197.181) 1.214 ms 1.208 ms 1.273 ms 5 te3-4.bbr1.ash1.bandcon.com (216.151.179.218) 36.039 ms 36.035 ms 36.215 ms 6 209.107.197.210 (209.107.197.210) 130.445 ms 130.312 ms 130.339 ms 7 te-5-5.rtr2.avl1.netriplex.com (67.23.161.254) 130.731 ms 130.990 ms 131.084 ms These routes have changed. I forget whom the traffic was flowing before through, but I do remember it going to Atlanta and out there, I don't think HE was in the mix. Can someone confirm this? I know from GIPNetworks to Netriplex in Asheville and visa versa was around 50ms and several hops less prior to this maintenance window.
Posted by pubcrawler, 12-19-2010, 08:50 AM
From Hurricane's network in Ashburn, VA to Netriplex in Asheville: 1 7 ms 16 ms 8 ms 10gigabitethernet1-2.core1.nyc4.he.net (72.52.92.85) 2 74 ms 74 ms 74 ms 10gigabitethernet3-3.core1.lon1.he.net (72.52.92.242) 3 170 ms 173 ms 175 ms rtr1.lon1.netriplex.com (195.66.225.82) 4 179 ms 176 ms 176 ms te-4-0-0.rtr1.avl1.netriplex.com (67.23.161.129) 5 176 ms 203 ms 177 ms te-5-5.rtr2.avl1.netriplex.com (67.23.161.254) Latency starts at 195.66.225.82, which is: LINX Brocade LAN 36167 195.66.225.82/23 1000 (1 GBIT) London Internet Exchange Wasn't traffic prior to now going through Equinix in Ashburn, often? Peering DB has the Equinix Ashburn as this for Netriplex: Equinix Ashburn 36167 pending... 10000 Dreading the latency when people wake up today and saturate whatever clog there now seems to be.
Posted by pubcrawler, 12-19-2010, 04:35 PM
(BUMP) The routing to Hurricane Electric centric locations is still showing goofy path and high times. I submitted a ticket to Netriplex 7 hours ago and they haven't said a word. From Netriplex Asheville to GIPNetworks, Dallas: 2 te-1-8.rtr1.avl1.netriplex.com (67.23.161.133) 0.271 ms 0.312 ms 0.369 ms 3 te-5-4.rtr1.atl1.netriplex.com (67.23.161.130) 6.645 ms 6.691 ms 6.757 ms 4 10gigabitethernet1-1.core1.lon1.he.net (195.66.224.21) 101.224 ms 101.232 ms 106.516 ms 5 10gigabitethernet2-3.core1.nyc4.he.net (72.52.92.77) 169.615 ms 169.285 ms 10gigabitethernet4-4.core1.nyc4.he.net (72.52.92.241) 169.712 ms 6 10gigabitethernet2-3.core1.ash1.he.net (72.52.92.86) 185.453 ms 175.822 ms 175.684 ms 7 10gigabitethernet1-1.core1.dal1.he.net (72.52.92.61) 210.479 ms 210.712 ms 210.696 ms 8 216.66.77.126 (216.66.77.126) 211.006 ms 211.233 ms 211.142 ms 9 po1.dist2.gip.ntxc.net (67.43.48.137) 130.464 ms 130.633 ms 130.553 ms To Wholesaleinternet in Missouri: 2 te-1-8.rtr1.avl1.netriplex.com (67.23.161.133) 0.243 ms 0.298 ms 0.378 ms 3 te-5-4.rtr1.atl1.netriplex.com (67.23.161.130) 6.688 ms 6.734 ms 6.797 ms 4 10gigabitethernet1-1.core1.lon1.he.net (195.66.224.21) 101.177 ms 101.258 ms 101.406 ms 5 10gigabitethernet4-4.core1.nyc4.he.net (72.52.92.241) 172.504 ms 172.497 ms 172.437 ms 6 10gigabitethernet1-2.core1.chi1.he.net (72.52.92.102) 199.850 ms 199.761 ms 191.222 ms 7 10gigabitethernet1-1.core1.mci1.he.net (72.52.92.2) 202.141 ms 201.921 ms 202.028 ms 8 10gigabitethernet1-1.core1.mci2.he.net (184.105.213.2) 202.166 ms 202.010 ms 202.218 ms 9 wholesale-internet-inc.10gigabitethernet1-4.core1.mci2.he.net (216.66.79.10) 202.699 ms 202.338 ms 202.463 ms 10 69.30.209.3 (69.30.209.3) 203.158 ms 204.791 ms 204.150 ms To IOFLOOD.com - Phoenix: 2 te-1-8.rtr1.avl1.netriplex.com (67.23.161.133) 0.266 ms 0.302 ms 0.364 ms 3 te-5-4.rtr1.atl1.netriplex.com (67.23.161.130) 6.651 ms 6.694 ms 6.751 ms 4 10gigabitethernet1-1.core1.lon1.he.net (195.66.224.21) 107.007 ms 106.929 ms 111.680 ms 5 10gigabitethernet4-4.core1.nyc4.he.net (72.52.92.241) 169.640 ms 10gigabitethernet2-3.core1.nyc4.he.net (72.52.92.77) 173.513 ms 173.639 ms 6 10gigabitethernet2-3.core1.ash1.he.net (72.52.92.86) 180.857 ms 176.538 ms 176.480 ms 7 10gigabitethernet1-1.core1.dal1.he.net (72.52.92.61) 211.784 ms 211.585 ms 211.900 ms 8 10gigabitethernet1-2.core1.phx1.he.net (72.52.92.253) 246.418 ms 246.377 ms 246.400 ms 9 10gigabitethernet2-1.core1.phx2.he.net (184.105.213.18) 239.243 ms 238.945 ms 239.074 ms 10 gateway.ioflood.com (64.71.145.46) 240.154 ms 241.972 ms 240.739 ms 11 ioflood.com (184.105.134.14) 239.184 ms 239.275 ms 238.735 ms I am perplexed. Hurricane has peering at Telx in Atlanta and so does Netriplex. You would think that handoff would likely happen and HE would take the short route through their own network to their customers. This appears to be a HE network issue, however, HE wasn't mucking with their network, Netriplex was.
Posted by ishan, 12-19-2010, 04:50 PM
Yes, it is going all the way to London and then back to NY. Other routes are fine. I am also seeing Level3 and XO which is a relief after seeing just Bandcon and Comcast for past few months.
Posted by (Stephen), 12-19-2010, 04:57 PM
I just did a trace to one of the IPs here in post 136, and it has some bad latency happening between atlanta and them, not sure what exactly, and this is on XO, not HE. 7 34 ms 34 ms 34 ms vb1310.rar3.dallas-tx.us.xo.net [216.156.0.105] 8 35 ms 34 ms 35 ms te-4-0-0.rar3.atlanta-ga.us.xo.net [207.88.12.1] 9 34 ms 34 ms 34 ms ae0d1.cir1.atlanta6-ga.us.xo.net [207.88.13.157] 10 124 ms 123 ms 123 ms 216.156.108.58.ptr.us.xo.net [216.156.108.58] 11 130 ms 130 ms 130 ms te-4-0-0.rtr1.avl1.netriplex.com [67.23.161.129] 12 130 ms 130 ms 130 ms te-5-5.rtr2.avl1.netriplex.com [67.23.161.254] 34-45ms would be normal to that area, 130 is insanely high. Defender technologies in Ashburn, VA is 41ms. Last edited by (Stephen); 12-19-2010 at 05:01 PM.
Posted by pubcrawler, 12-19-2010, 05:07 PM
I have ticket in at Netriplex from 9 hours ago. I just emailed support at HE to see if they are aware why traffic destined for their customers is taking the long trip abroad to London. Almost certain there is other traffic destined for other networks that is equally high latency and taking what appears to be clogged routes. 130ms to Atlanta is outrageous. From Asheville to Seattle shouldn't even come in that high.
Posted by pubcrawler, 12-19-2010, 05:29 PM
This is traceroute from London Internet Exchange's looking glass to Netriplex, Asheville: 1 rtr1.lon1.netriplex.com (195.66.225.82) 100 msec 100 msec 92 msec 2 te-4-0-0.rtr1.avl1.netriplex.com (67.23.161.129) [AS 36167] 100 msec 104 msec 100 msec 3 te-5-5.rtr2.avl1.netriplex.com (67.23.161.254) [AS 36167] 104 msec 100 msec 104 msec
Posted by Postbox, 12-19-2010, 07:08 PM
To HE at Fremont does the same 2 te-1-8.rtr1.avl1.netriplex.com (67.23.161.133) 0.570 ms 0.580 ms 0.579 ms 3 te-5-4.rtr1.atl1.netriplex.com (67.23.161.130) 7.040 ms 7.043 ms 7.045 ms 4 10gigabitethernet1-1.core1.lon1.he.net (195.66.224.21) 110.958 ms 111.202 ms 111.206 ms 5 10gigabitethernet4-4.core1.nyc4.he.net (72.52.92.241) 169.391 ms 169.636 ms 169.638 ms 6 10gigabitethernet3-1.core1.sjc2.he.net (72.52.92.25) 248.576 ms 259.049 ms 259.047 ms 7 10gigabitethernet1-1.core1.fmt1.he.net (72.52.92.109) 245.785 ms 244.812 ms 245.043 ms That's just nuts.
Posted by RyanD, 12-19-2010, 07:10 PM
Yeah, something is really wrong on their network, we're not getting anything on our peering session with them. Also looks like they moved their site off network. Depending on the source IP of the request I'm seeing it in NYC or Toronto for www.netriplex.com. They are not advertising any routes to us over our peering session, It's making it a real pain to access any of our customer's gear that is located in their facility routing to london first.
Posted by Postbox, 12-19-2010, 07:24 PM
You want a really crazy one? To GNAX Atlanta ... 2 te-1-8.rtr1.avl1.netriplex.com (67.23.161.133) 0.380 ms 0.617 ms 0.627 ms 3 te-5-4.rtr1.atl1.netriplex.com (67.23.161.130) 6.839 ms 6.875 ms 7.073 ms 4 linx.ge1-0.cr01.lhr01.mzima.net (195.66.225.15) 113.251 ms 113.256 ms 113.257 ms 5 te1-3.cr1.was2.us.packetexchange.net (69.174.120.97) 196.934 ms 197.001 ms 197.001 ms 6 te1-3.cr1.atl1.us.packetexchange.net (69.174.120.53) 203.156 ms 202.393 ms 202.373 ms 7 xe0-gnax.cust.atl01.mzima.net (67.199.136.154) 103.352 ms 103.366 ms 103.126 ms Where LHR = London Heathrow
Posted by pubcrawler, 12-19-2010, 08:39 PM
Yeppers Hop #4 is another London Internet Exchange IP. So it's clear it's going to the UK and back. I don't get these routes being so screwed. I see Telx peering in Atlanta for Netriplex and I swear they had Ashburn, VA Equinix in the mix prior to today. Not a word from Netriplex on any of this... Been 14 hours... Will be interesting to see how these loops to the UK and back perform during tomorrow's work day load.
Posted by Postbox, 12-19-2010, 08:50 PM
I'm not waiting to find out. Colleagues thought I was over-reacting switching to failover in Washington on Saturday, but with the flaps on the network during the Sunday maintenance and the crazy routing and latency since I'm glad I insisted! I take it you've still not had any word back from Netriplex?
Posted by pubcrawler, 12-19-2010, 09:18 PM
Last activity on my ticket at Netriplex was at 8:13 AM or nearly 12 hours ago. Nothing from them. Nothing from HE either. Imagine HE will see their UK bound links have irregular traffic come tomorrow and finally look into the matter. Rather disappointed, but not surprised. For a company (Netriplex) that has four of their own data centers and a pile of partner facilities, who has been at this since 1999, and has rather pricey offerings aside from Uberbandwidth, well I just expect more. "Every facility offers unsurpassed connectivity, reliability and redundancy to customers requiring unprecedented global coverage, failover and geographic traffic routing capabilities." The routing to the UK and back sure must have something to do with that global coverage and geographic traffic routing.
Posted by pubcrawler, 12-19-2010, 09:23 PM
I am going to email Jonathan Hoppe, Netriplex's Technical contact directly about this issue. I recommend anyone else who has an outstanding ticket or current routing issue to do the same. Send me a private message for his email.
Posted by Postbox, 12-19-2010, 09:28 PM
Indeed it is most Netriperplexing
Posted by pubcrawler, 12-19-2010, 09:49 PM
I received an email back from Jonathan Hoppe in a whopping 6 minutes. I owe him a beer for that, especially on a Sunday evening. "Thanks... you're the second one to create a ticket, but we found that HE is having peering issues in the US at the moment, or at least accepting our prefixes, I can't speak for other customers of theirs. We'll turn down peering to them in London to keep it within the country. My team should have this complete within the hour." I noted the similar Mzima routing path to the UK and back and awaiting his response on their remedy to that.
Posted by Postbox, 12-19-2010, 09:59 PM
Nice one! I'll try some more traces when I'm back (in 6 hours) - Just in time for the UK/EU rush
Posted by pubcrawler, 12-19-2010, 10:10 PM
Jonathan Hoppe said they'll resolve the Mzima long trip to UK tonight also. I have my fingers crossed that they are able to favorably clean these routes up quickly. I am almost certain traffic for us often was going out and up to Ashburn, VA Equinix facility. Wondering if that peering is currently down or BGP entries gone for such? Equnix's mix is simply magic and really improves most things in my somewhat limited experience. If anyone else notices strange routes out of Netriplex Asheville on via carriers other than HE or Mzima post traceroutes here and I'll forward them so Netriplex can work on ignoring those route requests.
Posted by RyanD, 12-19-2010, 11:05 PM
They simply are not announcing routes or their links are down, they are not announcing any routes directly to us on the AIX exchange so it is not an HE problem. We do not have their routes dampened so this is something on their side, this is not an HE problem.
Posted by pubcrawler, 12-19-2010, 11:17 PM
I just received a response from HE directly: "We advertise the same prefixes to them in all common locations. As it is, I'm getting no replies from their exchange address in Atlanta, which would explain why no traffic is going through there. I tried clearing the BGP session from here, with no change." That confirms what RyanD said.
Posted by pubcrawler, 12-19-2010, 11:24 PM
Yep, indeed Netriplex is down at the AIX/Telx Atlanta as per this, see: https://tie.telx.com/participants/ Netriplex LLC 36167 198.32.132.91 Down 2001:478:132::91 Down Open Similarly, their connectivity to the Dallas Telx internet exchange appears to be down also. see: https://tie.telx.com/participants/ Netriplex LLC 36167 206.126.114.16 2001:504:17:114::16 Open
Posted by garysimat, 12-19-2010, 11:34 PM
198.32.132.91 4 36167 327652 300504 0 0 0 19:07:09 Active We peer with them on the Atlanta exchange and the session has been down for at least 19 hours.
Posted by pubcrawler, 12-20-2010, 03:21 AM
Atlanta Exchange link still down Still routing traffic to UK and back. Guess I am setting my clock to get up early and redirect our traffic away from Netriplex. So much for a quick resolution. Hopefully, when I wake up some miracle has occurred and things are back to normal. Did some looking into the impact in one glaring way this routing is impacting us. We have an hourly job that copies databases and core assets, creates a tar/gzip archive and ships a good sized file from Netriplex to Dallas. Normally we get an email about the job being completed at about 24 minutes after the hour. Now due to latency issues, it's 33 minutes after the hour, another 9 minutes there. Can't wait to see this job likely start failing as the latency goes through roof to UK. Thanks to everyone who helped on this thread from a more knowledgeable perspective inside the network in Atlanta.
Posted by Postbox, 12-20-2010, 06:30 AM
It's no different this morning (UK time). Still routing from Atlanta to Atlanta via UK 2 te-1-8.rtr1.avl1.netriplex.com (67.23.161.133) 0.582 ms 0.597 ms 0.597 ms 3 te-5-4.rtr1.atl1.netriplex.com (67.23.161.130) 6.805 ms 7.050 ms 7.053 ms 4 linx.ge1-0.cr01.lhr01.mzima.net (195.66.225.15) 109.723 ms 109.724 ms 109.726 ms 5 te1-3.cr1.was2.us.packetexchange.net (69.174.120.97) 190.152 ms 190.384 ms 190.386 ms 6 te1-3.cr1.atl1.us.packetexchange.net (69.174.120.53) 202.609 ms 215.311 ms 214.811 ms 7 xe0-gnax.cust.atl01.mzima.net (67.199.136.154) 103.851 ms 103.852 ms 104.055 ms The other traces I did last night are just the same. Thank YOU for raising this - I doubt I would have checked until I saw your posts
Posted by pubcrawler, 12-20-2010, 08:56 AM
Morning everyone! They are still routing through the UK. So much for a quick miracle resolution.
Posted by pubcrawler, 12-20-2010, 09:46 AM
Noticed that route between Netriplex in Asheville and GIPNetworks via HE's pipe just changed routing. 1 66.219.24.81 (66.219.24.81) 0.337 ms 0.395 ms 0.454 ms 2 te-1-8.rtr1.avl1.netriplex.com (67.23.161.133) 0.267 ms 0.312 ms 0.375 ms 3 209.107.197.209 (209.107.197.209) 14.606 ms 14.666 ms 14.739 ms 4 216.151.179.142 (216.151.179.142) 14.666 ms 14.718 ms 14.766 ms 5 10gigabitethernet2-2.core1.ash1.he.net (206.223.115.37) 16.036 ms 15.583 ms 15.773 ms 6 10gigabitethernet1-1.core1.dal1.he.net (72.52.92.61) 53.067 ms 57.862 ms 58.029 ms 7 216.66.77.126 (216.66.77.126) 51.003 ms 50.536 ms 50.789 ms 8 po1.dist2.gip.ntxc.net (67.43.48.137) 50.587 ms 50.438 ms 50.496 ms 9 gi0-2.cab603.gip.ntxc.net (66.207.170.14) 52.685 ms 50.435 ms 52.900 ms Still mucked up though for Mzima/HighWinds bound traffic it seems. Our CDN is being pushed all the way over to the Netherlands...
Posted by pubcrawler, 12-20-2010, 09:48 AM
To GNAX in Atlanta is still going the steam ship route to UK. Hopefully, they get that route cleaned this morning. Better late than never.
Posted by Postbox, 12-20-2010, 10:14 AM
Morning/afternoon I was about to post that inbound routes seemed to be a lot better .... until .... Chicago anyone? 2 64.79.96.193.rdns.continuumdatacenters.com (64.79.96.193) 6.904 ms 7.041 ms 7.075 ms 3 xe1-0.cr01.ord01.mzima.net (72.37.148.137) 2.545 ms 2.555 ms 2.554 ms 4 te4-5.cr1.nyc1.us.packetexchange.net (69.174.120.73) 49.139 ms 49.168 ms 49.164 ms 5 te0-5.cr1.lon1.uk.packetexchange.net (69.174.120.90) 115.821 ms 116.027 ms 116.019 ms 6 rtr1.lon1.netriplex.com (195.66.225.82) 114.890 ms 114.723 ms 114.753 ms 7 te-4-0-0.rtr1.avl1.netriplex.com (67.23.161.129) 120.789 ms 120.856 ms 120.920 ms 8 te-5-5.rtr2.avl1.netriplex.com (67.23.161.254) 121.151 ms 121.221 ms 121.379 ms Hmmmm
Posted by pubcrawler, 12-20-2010, 10:37 AM
Ouch! That's a long trip to get to/from Chicago. Mzima/Highwinds peering obviously is still broken at Netriplex. Hoping other folks test networks they are on and see what if any, other issues are out there.
Posted by RyanD, 12-20-2010, 10:47 AM
Looks like they restored the Ashburn bandcon link.
Posted by pubcrawler, 12-20-2010, 11:57 AM
Route to our CDN provider seems back in order now too: traceroute to cdn.pubcrawler.com (67.201.31.32), 30 hops max, 60 byte packets 1 66.219.24.81 (66.219.24.81) 0.284 ms 0.340 ms 0.403 ms 2 te-1-8.rtr1.avl1.netriplex.com (67.23.161.133) 0.254 ms 0.306 ms 0.370 ms 3 te-5-4.rtr1.atl1.netriplex.com (67.23.161.130) 6.645 ms 6.710 ms 6.768 ms 4 te-4-2.car1.Atlanta4.Level3.net (4.53.233.37) 6.690 ms 6.749 ms 6.803 ms 5 ae-28-52.car2.Atlanta4.Level3.net (4.69.150.72) 6.808 ms 6.868 ms 6.933 ms 6 MZIMA-NETWO.car2.Atlanta4.Level3.net (4.53.234.6) 7.055 ms 14.008 ms 14.055 ms 7 67.201.31.32 (67.201.31.32) 6.764 ms 6.765 ms 6.750 ms Huge improvement!
Posted by pubcrawler, 12-20-2010, 11:59 AM
Route from Asheville to Continuum looks better now: traceroute to 64.79.96.193 (64.79.96.193), 30 hops max, 60 byte packets 1 66.219.24.81 (66.219.24.81) 0.577 ms 0.900 ms 0.893 ms 2 te-1-8.rtr1.avl1.netriplex.com (67.23.161.133) 72.603 ms 72.668 ms 72.719 ms 3 te-5-4.rtr1.atl1.netriplex.com (67.23.161.130) 6.667 ms 6.736 ms 6.786 ms 4 64.209.108.77 (64.209.108.77) 6.697 ms 6.703 ms 6.747 ms 5 xe-5-0-1.ar1.ord1.us.nlayer.net (69.31.110.229) 50.422 ms 50.154 ms 50.119 ms 6 64.79.96.193.rdns.continuumdatacenters.com (64.79.96.193) 55.173 ms 53.297 ms 50.176 ms
Posted by pubcrawler, 12-21-2010, 06:16 PM
Found another route that is taking the long trip to the UK. This time it involves traffic destined for Limelight's customers: 1 66.219.24.81 (66.219.24.81) 0.350 ms 0.395 ms 0.451 ms 2 te-1-8.rtr1.avl1.netriplex.com (67.23.161.133) 0.266 ms 0.319 ms 0.377 ms 3 te-5-4.rtr1.atl1.netriplex.com (67.23.161.130) 6.664 ms 6.709 ms 6.768 ms 4 tge1-4.fr3.lon.llnw.net (195.66.224.133) 100.991 ms 107.510 ms 107.609 ms 5 tge7-2.fr3.lga.llnw.net (69.28.171.125) 96.866 ms 96.945 ms 97.042 ms 6 tge13-1.fr3.ord.llnw.net (68.142.125.45) 123.657 ms tge8-4.fr3.ord.llnw.net (69.28.171.193) 126.839 ms 126.754 ms 7 tge2-1.fr3.sjc.llnw.net (69.28.171.66) 164.169 ms tge13-3.fr3.sjc.llnw.net (69.28.189.21) 158.639 ms tge2-1.fr3.sjc.llnw.net (69.28.171.66) 164.263 ms 8 tge12-1.fr3.lax.llnw.net (69.28.172.53) 156.245 ms tge14-4.fr3.lax.llnw.net (69.28.189.9) 157.149 ms tge12-1.fr3.lax.llnw.net (69.28.172.53) 156.039 ms 9 68.142.106.78 (68.142.106.78) 156.213 ms 156.329 ms 156.099 ms 10 br01-1-2.lax4.net2ez.com (64.93.64.162) 157.136 ms 156.601 ms 156.790 ms 11 cr02-1-2.lax4.net2ez.com (64.93.64.78) 158.973 ms 159.219 ms 158.203 ms 12 mt-cr02.mediatemple.net (64.93.75.18) 158.950 ms 159.062 ms 159.400 ms 13 72.10.63.198 (72.10.63.198) 163.112 ms 162.962 ms 163.129 ms 14 opsdv02.mediatemple.net (64.207.129.58) 158.508 ms 159.050 ms 159.014 ms I've filed a ticket and notified the folks at Netriplex on this. I encourage folks to look at routes between that network and their other locations to make sure other routing isn't poor like this.