Inforelay Down for +4 hours - Los Angeles - Knowledgebase

Portal Home > Knowledgebase > Industry Announcements > Web Hosting Main Forums > Providers and Network Outages and Updates > Inforelay Down for +4 hours - Los Angeles

Inforelay Down for +4 hours - Los Angeles

Posted by HSN-Saman, 06-09-2013, 10:28 PM
Hello,

As of writing this it's over 4 hours downtime for Inforelay 's Network in LA. They have two redundant cores in Equinix LA1 , we're being fed from both redundant cores for more redundancy .

We had open a ticket as soon as we got noticed from our monitoring system , at first they told it's only your problem and entire network is up but after 2 hour they told that's it's a bigger and serious issue impacts all customers in their LA network.

Also two days ago we noticed that our throughput speed through Inforelay to *any where* is really low and unacceptable , at least 20x less than what it had been before or is on the other links. the ticket was open and reporting the issue with logs and clear explanation . we spent two days to convince them that it's their problem not their upstreams. LOL , they told me that they charge for support in weekends - their problem , you pay

NO ETA given till now. they just told that they have high cpu utilization on their cores causing the issue. cores not responding to ICMP requests , vlan down.

Is there anyone here using Inforelay's Network in Los Angeles? Let me know.
my email address is saman-at-hugeserver-dot-com

Thanks.
Posted by JustinAY, 06-09-2013, 10:55 PM
InfoRelay has some staff on here, I believe. Hopefully someone can ping this thread and let you know what's going on.
Posted by HSN-Saman, 06-09-2013, 11:33 PM

Quote:

Originally Posted by JustinAY

InfoRelay has some staff on here, I believe. Hopefully someone can ping this thread and let you know what's going on.

If they knew how to get it solved it must be solved by now with speaking to their manager I just heard My apologies ,etc
6 hours and counting ...
Posted by HSN-Saman, 06-10-2013, 01:16 AM
LOL , their core routes are not responding to pings and they said :

Quote:

can you confirm that your end is physically connected?

really poor management for disaster recovery ... no one answering the phones at the moment but they confirmed it's their fault and it's spanning tree configuration issue caused the loop. before it was reported high cpu utilization , ddos ,etc - they're still not sure what's the issue to give an ETA

8 hours of downtime is not something serious for them yet ;-)
Posted by Katie, 06-10-2013, 02:55 AM
Ouch! I hope you update when they/you come back online.
Posted by HSN-Saman, 06-10-2013, 03:08 AM

Quote:

Originally Posted by Katie

Ouch! I hope you update when they/you come back online.

Going toward 10 hours of downtime , something strange is showing up in MTR's now after their core router : "mailtest.100mbpsservers.com" - still routing issues on their side and they still try to hide the exact issue and be innocent !
Posted by HSN-Saman, 06-10-2013, 03:19 AM
Finnally after 10 hours of sending email and making calls the BGP session is up since a few minutes ago.
Posted by Preetam, 06-10-2013, 07:06 AM
Any updates from them about what happened?
Posted by HSN-Saman, 06-10-2013, 02:04 PM

Quote:

Originally Posted by Preetam

Any updates from them about what happened?

Nothing yet with details , their president told me that they will release the RFO for this downtime ASAP.
Also he stated " We are also upgrading both routers in Los Angeles to the latest Juniper hardware within the coming months, which will also help to protect us from issues like this."
Posted by HD-Sam, 06-14-2013, 01:18 AM
Saman, is this at 900 N Alameda? Did you hear back from the president?
Posted by Jeremy, 06-14-2013, 02:21 AM
I would have gone down there if you liked and kickem in the you know what!

=p
Posted by HSN-Saman, 06-17-2013, 09:35 AM

Quote:

Originally Posted by HD-Sam

Saman, is this at 900 N Alameda? Did you hear back from the president?

They have 3 or 4 locations in LA all connected to a single core router in their main POP in LA , and run two routers in Equinix that we are connected to but when the core one has no redundancy it wouldn't help that much to have redundant ag cores.

The RFO has been released on their website claiming that the source of the issue was cisco firmware and IPv6 causing high cpu utilization on their both cores !

Also lack of out of band remote management and awaiting Equinix remote hand service on weekend had extend the downtime duration.

The president told me that they will switch to Juniper MX's in 3-6 months to avoid such problems, also they applied a portion of our service fee as a credit on our account.

https://support.inforelay.com/index....nloaditemid=31
Posted by hostvirtual, 06-17-2013, 10:12 AM
We're an InfoRelay customer in multiple locations including Los Angeles. HSN-Saman- That doesn't really completely reflect the RFO or their set up in Los Angeles-

Although we are very unhappy and disappointed with the outage, having worked with them for over 10 years we're confident they are taking the steps necessary to prevent similar issues again.
Posted by Rusty500, 06-19-2013, 06:18 PM
HostVirtual, thanks for sharing your thoughts.

We just became aware of this thread.

Saman, as HostVirtual has indicated, the information that you provided is not 100% accurate.

To clarify, Saman's company had an issue with certain routes being slow, which affected some of their customers. This problem was traced to an upstream network provider. Unfortunately the time to resolution on this was not optimal, because we continued to ask for iperf and MTR data, often with no response. Many of the replies that we received from your side were threatening, including one in which you threatened to post on WebHostingTalk if we did not get the issue resolved promptly -- and this was in response to us asking you for more information. Our technicians responded promptly to your requests, typically within minutes of each submission. Threatening to post publicly is not a means to get an issue resolved.

We then learned that there was an issue with your mail server and that you were not receiving our responses -- this is one of the reasons why resolution took so long for your initial issue. I'm sure that this related to your frustrations; perhaps you did not believe that we were responding to your requests.

The issue that you were experiencing with slowness was unrelated to the larger issue, though began to occur earlier in the weekend, prior to the larger issue.

Regarding charging you for support, we only charge for weekend support for network engineering if the problem is found to be outside of our network (i.e. if the problem is outside of our control).

While the slowness issue that you experienced may have gone on for quite a bit of time (over 10 hours), that was in part due to the lack of responsiveness from your side due to your e-mail server being down. The major issue that affected most of the LA customers was significantly shorter than this, and is outlined in the RFO.

The issue that prompted this event in LA was related to a facility cross-connect provided by one of our three data centers in Los Angeles. While the RFO explained all relevant details, analysis of RSPAN data indicated that an infrastructure connection between two floors of the building had third party switch gear between our equipment; something that we were told was specifically not the case earlier on. In short, a layer 2 loop was formed, which caused major network issues.

As you are aware, we promptly issued the RFO and provided service credits to affected customers, including your company. We also explained major steps that we will be taking to ensure that nothing like this happens again. As we have made clear, this is the most major network issue that we've seen in our 18 years of operation. We take this very seriously, and in addition to some modifications to our procedures, we will be doing a complete refresh of our equipment in Los Angeles in the coming months.

While we are regretful that there was an issue, we are doing everything in our power to ensure that nothing like this can happen again. If you wish to continue our discussion, I welcome the ability to do that privately with you. I do not believe that a public forum is the appropriate place to discuss these matters.

If there are any questions or concerns, I recommend that you contact me privately. I have already communicated with you directly, and you're welcome to get back to me if you'd like to discuss further.
Posted by HSN-Saman, 06-19-2013, 08:18 PM
Hello,

Thank you for your post.

1- What I have wrote here is 100% accurate and there is not any private things from us.

2- The issue you are saying for your network low Network Performance was related to your routes, and after you removed the ACLs and other configuration from our uplink, the issue was solved but after 3 days

3- There was nothing wrong from us, we have always replied your ticket from your portal during we had issue with receiving emails from you. So it was not the thing which made the delays, the delays was from your technical.

4- You were aware of your network outage once we have emailed you, and after 3 Hours you have sent guys to data center to check the issue.

Let me know if I am wrong on the details

Quote:

Originally Posted by Rusty500

HostVirtual, thanks for sharing your thoughts.

We just became aware of this thread.

Saman, as HostVirtual has indicated, the information that you provided is not 100% accurate.

To clarify, Saman's company had an issue with certain routes being slow, which affected some of their customers. This problem was traced to an upstream network provider. Unfortunately the time to resolution on this was not optimal, because we continued to ask for iperf and MTR data, often with no response. Many of the replies that we received from your side were threatening, including one in which you threatened to post on WebHostingTalk if we did not get the issue resolved promptly -- and this was in response to us asking you for more information. Our technicians responded promptly to your requests, typically within minutes of each submission. Threatening to post publicly is not a means to get an issue resolved.

We then learned that there was an issue with your mail server and that you were not receiving our responses -- this is one of the reasons why resolution took so long for your initial issue. I'm sure that this related to your frustrations; perhaps you did not believe that we were responding to your requests.

The issue that you were experiencing with slowness was unrelated to the larger issue, though began to occur earlier in the weekend, prior to the larger issue.

Regarding charging you for support, we only charge for weekend support for network engineering if the problem is found to be outside of our network (i.e. if the problem is outside of our control).

While the slowness issue that you experienced may have gone on for quite a bit of time (over 10 hours), that was in part due to the lack of responsiveness from your side due to your e-mail server being down. The major issue that affected most of the LA customers was significantly shorter than this, and is outlined in the RFO.

The issue that prompted this event in LA was related to a facility cross-connect provided by one of our three data centers in Los Angeles. While the RFO explained all relevant details, analysis of RSPAN data indicated that an infrastructure connection between two floors of the building had third party switch gear between our equipment; something that we were told was specifically not the case earlier on. In short, a layer 2 loop was formed, which caused major network issues.

As you are aware, we promptly issued the RFO and provided service credits to affected customers, including your company. We also explained major steps that we will be taking to ensure that nothing like this happens again. As we have made clear, this is the most major network issue that we've seen in our 18 years of operation. We take this very seriously, and in addition to some modifications to our procedures, we will be doing a complete refresh of our equipment in Los Angeles in the coming months.

While we are regretful that there was an issue, we are doing everything in our power to ensure that nothing like this can happen again. If you wish to continue our discussion, I welcome the ability to do that privately with you. I do not believe that a public forum is the appropriate place to discuss these matters.

If there are any questions or concerns, I recommend that you contact me privately. I have already communicated with you directly, and you're welcome to get back to me if you'd like to discuss further.

Posted by belia, 06-19-2013, 11:24 PM
You're trying to smear or 'flame' your host it looks like.

If 'INFO' asks you for MTR and iperf data then you have to provide that information if you complain about internet issues. Instead it looks like you just threatened to smear them on webhostingtalk.

If I had a client threaten me like that I'd kick them out. Simply put. It's clear that you're trying to smear them a little bit here.

I believe that not everyone was down for 10 hours as you made it look. Do I believe YOU were down for 10 hours? Yes, I believe that.

You should figure you smearing your own host isn't going to help HugeServer.com sell servers. Your clients will just think if you host your servers with HSN-Saman or HugeServer.com maybe they should just move them directly with INFO
Posted by HSN-Saman, 06-20-2013, 04:03 AM
Hi,

Interesting you register today and post on this thread against us

1- All the information was sent to them timely and correct, but they were not able to manage the information and get it solved timely.

2- Everyone was down for 10hours at their LAX network. and it is said in their RFO.

Have a nice time

Quote:

Originally Posted by belia

You're trying to smear or 'flame' your host it looks like.

If 'INFO' asks you for MTR and iperf data then you have to provide that information if you complain about internet issues. Instead it looks like you just threatened to smear them on webhostingtalk.

If I had a client threaten me like that I'd kick them out. Simply put. It's clear that you're trying to smear them a little bit here.

I believe that not everyone was down for 10 hours as you made it look. Do I believe YOU were down for 10 hours? Yes, I believe that.

You should figure you smearing your own host isn't going to help HugeServer.com sell servers. Your clients will just think if you host your servers with HSN-Saman or HugeServer.com maybe they should just move them directly with INFO

Posted by HSN-Saman, 06-29-2013, 04:30 AM
So,

They are keeping us down for another 3 Hours till now. And the same sh** answers on the phone and emails " We do not have any ETA, how soon it solves "

Really funny ..
Posted by hostvirtual, 06-29-2013, 10:29 AM
... As part of the scheduled maintenance they are working on in response to the original issue you posted about.

Quote:

We are writing to inform you of maintenance at the InfoRelay LAX1, LAX2 and LAX3 facilities, in Los Angeles, CA. This maintenance will be performed on customer-facing switching and routing gear during the following window:

Friday, June 28th 21:00 PDT to Saturday, June 29th 05:00 PDT

During this window, InfoRelay network engineers will be installing and configuring new gear in response to and to prevent a recurrence of last month's outage.

No interruption in service is anticipated during this window. Due to the nature of this work, however, a service interruption is possible.

That said, we haven't seen an issue from our monitoring of LAX1 or LAX3. Isn't this why you multi home?
Posted by Rusty500, 06-29-2013, 02:21 PM
Saman,

As HostVirtual indicated, this was a maintenance window announced approximately 2 weeks in advance. As you're also aware, the work was performed after business hours.

We have provided you with a service credit for the issue and have explained measures that we will be taking to ensure that this will not happen again.

We do not monitor WebHostingTalk regularly, so the best thing you can do to get our attention is to contact us as opposed to posting on this forum.

We have many large clients in the Los Angeles area, including Fortune 500 companies, and while we have provided service credits to each of them, none of them has demonstrated the lack of understanding or professionalism that you are demonstrating here. Frankly I do not think your customers will be impressed with your language or your general demeanor here.

While we are confident that we will not see issues such as these in the future, if you would like to pick up a cross-connect to a second provider for redundancy, we would be glad to assist you with this.
Posted by HSN-Saman, 07-02-2013, 06:52 AM

Hi,

@hostvirtual : No the issue was on their Lax1 network and they have said this in thier tickets and phone calls. I am understanding the nature of such updates. They have planned their works btw. 9pm and 5am. We are working with international clients and they may be located in Europe, Asia or anywhere else. Imagine the time Inforelay is working on their network it was a pick time in Eu and the server of your clients are unreachable, what can your clients do ? Lose money.

Any way they have been down again today about 40 mins. Was that again a planned work?

@Rusty500 : The service credit you provided me is not doing anything for us. You can not buy your clients with service credits. We need quality and UPTIME and it is why we are paying for redundant uplinks to you guys.

The relationship btw. us and our clients is more better than you and your clients. As we know how important is for our clients to be online and do everything to keep them online and not down.