Portal Home > Knowledgebase > Industry Announcements > Web Hosting Main Forums > Providers and Network Outages and Updates > Colo4 Service Entrance 2 issue


Colo4 Service Entrance 2 issue




Posted by pvanmeter, 08-10-2011, 01:19 PM
There has been an issue affecting one of our 6 service entrances. The actual ATS (Automatic Transfer Switch) is having an issue and all vendors are on site. Unfortunately, this is affecting service entrance 2 in the 3000 Irving facility so it is affecting a lot of the customers that have been here the longest. The other entrance in 3000 is still up and working fine as well as the 4 entrances in 3004. Customers utilizing A/B should have access to their secondary link. It does appear that some customers were affected by a switch that had a failure in 3000. That has been addressed and should be up now. This is not related to the PDU maintenance we had in 3004 last night. Separate building, service entrance, UPS, PDU, etc. We will be updating customers as we get information from our vendors so that they know the estimated time for the outage. Once this has been resolved we also distribute a detailed RFO to those affected. Our electrical contractors, UPS maintenance team and generator contractor are all on-site and working to determine what the best course of action is to get this back up.

Posted by VPSForge-Ray, 08-10-2011, 01:21 PM
Finally! Thanks for informing us the problem.

Posted by xnpu, 08-10-2011, 01:27 PM
Thanks for the update. Hope this won't be an all-nighter for me. (1.30am here)

Posted by sosys, 08-10-2011, 01:28 PM
after 1 hour and 30 minutes, i can see this update. can you update via twitter ?

Posted by Eid Adam, 08-10-2011, 01:30 PM
Hope this will not take longer then 20 mins

Posted by Shwan, 08-10-2011, 01:31 PM
Thanks for the update. Please keep us posted.

Posted by wap35, 08-10-2011, 01:33 PM
Honestly, this is just not acceptable. A lot of your customers will be losing their customers. - Whatever happened to proper redundancy and failover? - How can you possibly compensate your customers for losing their customers? - I am going to ask my provider why they go with you when you don't have a proper plan in place. Last edited by wap35; 08-10-2011 at 01:45 PM.

Posted by zdom, 08-10-2011, 01:36 PM
Can you please let us know why there is obviously only one of this important component? Redundancy is the most important issue in data centers...

Posted by XTremo, 08-10-2011, 01:37 PM
I'd be interested to know that as well.

Posted by elmister, 08-10-2011, 01:38 PM
People seems to be posting complaints here http://maps.google.com/maps/place?cl...=0CCAQnwIoATAA too many complaints in little time

Posted by Patrick, 08-10-2011, 01:40 PM
Those people are morons. I think this is awful, I really do, not the outage, but how some people are freaking out and throwing Colo4 under the bus. If this happened every year, I'd understand... but this is a very, very freak sort of thing that no one is immune from. Google, Amazon, companies with budgets 1000x larger than Colo4 suffer outages of varying degrees, even power outages... there is nothing that is guaranteed in this world.

Posted by Formas, 08-10-2011, 01:41 PM
Sure, redundancy is completely necessary. How can we trust that tomorrow there will be no problem with other equipment that has no redundancy. And about ETA to fix this issue?

Posted by gbcactus, 08-10-2011, 01:43 PM
This is not acceptable. They should have backup systems. I thought that was standard practice for hosts and data centers. Does anyone have a recommendation for a hosting reseller that's more dependable?

Posted by elmister, 08-10-2011, 01:43 PM
Paul, we've customers on the phone complaining, people is mostly asking for an ETA and we have nothing to tell them. Could you give us an estimation?

Posted by rds100, 08-10-2011, 01:43 PM
But as far as i understand there IS redundancy. Customers who chose to be connected to both power feed A and power feed B are not affected. Those who chose to be connected to only one power feed (i.e. chose no redundancy) are affected.

Posted by Stockli, 08-10-2011, 01:47 PM
The lack of communication is the issue for me. An hour and a half of downtime before there's any publicly available info about what's going on is unacceptable. They don't have a status reporting system (that I can find). They don't use Twitter. I understand that "stuff happens", but the most important part of keeping trust in the relationship is good communication, and they get a "fail" on that part.

Posted by zdom, 08-10-2011, 01:47 PM
I will ask this our hoster... No chance to switch all from A to B?

Posted by sosys, 08-10-2011, 01:49 PM
been 2 hours and counting..

Posted by Formas, 08-10-2011, 01:52 PM
Strange, because if I have redundancy obviously that my site also has. But I can not access http://www.colo4.com/

Posted by troboy, 08-10-2011, 01:52 PM
Its 11.30PM in Mumbai, I hope, you will be up in couple of hours. This is not happening first time, this downtime event occurred 2 times in year 2011. Colo4 should maintain backup power plan.

Posted by pvanmeter, 08-10-2011, 01:55 PM
Updates here will be slow, as our techs are working on affected customers. Redundancy is in place. We have 2 100% diverse power plants. With the PDU issue the other day and now an ATS issue having A/B power would have avoided any power issue. As stated, equipment of any kind can fail no matter how redundant you make it. Each entrance has it's own generator, service entrance/transformers, UPS, PDU and even 30 hour fuel supply just to avoid having single points of failure. I don't have an ETA on this plant returning. The contractors are going through the systems now to determine what caused the outage. This must be done before restoring a 2000 amp ATS (that's 2000 amps at 480 volt). This is absolutely the first time we have had a service entrance failure, and still with an entire entrance going down customers using A/B were not affected. The website issue just shows that sometimes even with redundancy people make a mistake. It was on 2 power circuits to two UPSs, but same entrance. Our guys were focused on customers and didnt even catch that.

Posted by rds100, 08-10-2011, 01:55 PM
Not necessarily. Only if you use a server with 2 redundant power supplies - those cost more usually. And as stated above even when you have two power supplies there are ways to plug everything without redundancy

Posted by zdom, 08-10-2011, 02:00 PM
Sorry guys but it's unacceptable to be down for this long time.

Posted by neumannu47, 08-10-2011, 02:06 PM
What up time were you promised when you signed up?

Posted by wynnyelle, 08-10-2011, 02:06 PM
Bet this wasn't in the contract when people signed up. That this could happen? Unacceptable.

Posted by layer0, 08-10-2011, 02:10 PM
Downtime can happen in any facility. If this is such a big problem for you, then you should look into geographic redundancy for your infrastructure.

Posted by nightmarecolo4, 08-10-2011, 02:10 PM
ATS's rairly fail. If it did that is a single point of failure and you should have a redundant ATS. Either way you should be on Generator or Utility now through a redundant ATS.Your UPS should have taken the glitch during switching. Please can you provide an ETA?

Posted by zdom, 08-10-2011, 02:10 PM
Our hoster has 100% uptime guarantee...

Posted by user5151, 08-10-2011, 02:10 PM
Paul - your A/B comment is off the mark, I believe... Tracert to our network inside of colo4 shows it dying at the HSRP router... 1 2 ms 2 ms 2 ms 10.35.0.1 2 3 ms 1 ms 2 ms 66.78.210.193 3 3 ms 2 ms 1 ms 66.78.210.242 4 10 ms 6 ms 4 ms 209.242.143.97 5 5 ms 25 ms 4 ms 68.6.8.164 6 5 ms 4 ms 5 ms 68.6.8.14 7 17 ms 55 ms 19 ms 68.1.0.235 8 19 ms 17 ms 18 ms 64.125.31.70 9 18 ms 16 ms 17 ms 64.125.26.29 10 48 ms 51 ms 48 ms 64.125.25.17 11 96 ms 97 ms 75 ms 64.125.31.121 12 49 ms 103 ms 49 ms 64.125.27.82 13 49 ms 49 ms 49 ms 64.124.196.226 14 * * * Request timed out. 15 * * * Request timed out. 64.124.196.226 is abovenet... not colo4.. based on my notes, 65.99.244.225 is the colo4 HSRP hop.... tracert is dying BEFORE your HSRP hop.... So, why isn't this up? Is the HSRP router not on A/B power circuits either??? We're using A/B power circuits for our equipment, the major servers, but they still aren't reachable because this HSRP device isn't online yet. There's been no answer from your support team on this.

Posted by Patrick, 08-10-2011, 02:12 PM
Uptime guarantees only mean that in the event of an outage, usually there is some form of compensation if the provider fails to meet that guarantee. It does not mean they will never go down.

Posted by xnpu, 08-10-2011, 02:12 PM
Then you should complain to him, not here. Also note that most 100% uptime guarantees are not really guarantees. They promise you (usually uselessly low) compensation if the 100% is not reached. Contact your hoster and claim your $1.

Posted by KevinJ5, 08-10-2011, 02:12 PM
One of my servers still pings but, higher than normal latency: | 66.109.6.52 - 0 | 61 | 61 | 14 | 23 | 50 | 21 | | 66.109.6.209 - 0 | 61 | 61 | 15 | 20 | 45 | 20 | | 4.59.32.17 - 0 | 61 | 61 | 17 | 32 | 81 | 57 | | 4.69.145.200 - 0 | 61 | 61 | 16 | 26 | 83 | 21 | | 8.9.232.74 - 4 | 58 | 56 | 444 | 476 | 614 | 468 | | 206.123.64.130 - 0 | 61 | 61 | 14 | 20 | 40 | 14 | Winmtr Last edited by KevinJ5; 08-10-2011 at 02:18 PM.

Posted by wynnyelle, 08-10-2011, 02:13 PM
No answers and no estimate as to when it will be up again. Fail. I'm losing money and user trust as we speak.

Posted by divva, 08-10-2011, 02:16 PM
It's the radio silence that's inexcusable. When clients are calling - we need to be able to tell them something instead of "duh - I don't know." People get a lot less pissed when they hear the "why" behind the outage. They're placated even more when they learn their data is safe - which we still don't know with this outage.

Posted by neumannu47, 08-10-2011, 02:18 PM
Is your hoster Colo4 or a reseller of Colo4's services?

Posted by UH-Matt, 08-10-2011, 02:18 PM
You have a 100% SLA on power in your contract with colo4, so will be able to submit a claim of you have lost power.

Posted by sosys, 08-10-2011, 02:18 PM
is running now ? colo4.com is up

Posted by wynnyelle, 08-10-2011, 02:20 PM
They apparently don't see fit to reassure all the sites and businesses that they are ruining with this.

Posted by VN-Ken, 08-10-2011, 02:23 PM
Have you ever thought that maybe they have their man power tied up in trying to get your customers back online? They have provided an explanation - obviously details won't be provided until everything is back up. Rationalize please.

Posted by pvanmeter, 08-10-2011, 02:23 PM
that traceroute is showing that Abovenet is down, not our network because we have multiple providers. There was one switch that did have a failure here also, which was rectified. Was that trace from earlier today or is that just hitting Abovenet? We do have multiple ATSs on-site and live. Their failure is very rare, but that is why we build facilities that can offer 2N power. Just in case customers that use A/B are still up.

Posted by wynnyelle, 08-10-2011, 02:24 PM
Hard to rationalise having not even a twitter account? You can set one up for free in 5 minutes.

Posted by pvanmeter, 08-10-2011, 02:27 PM
And a site update: The site was taken to a crawl with requests. It will be replaced with site update.

Posted by wynnyelle, 08-10-2011, 02:28 PM
Is the data safe?

Posted by divva, 08-10-2011, 02:28 PM
No one has said "give us "X" amount of time to get this back online." That's what most of us need right now.

Posted by TheWiseOne, 08-10-2011, 02:29 PM
Yes, most customers are looking for an ETA. It can't be impossible to provide a ballpark figure.

Posted by wynnyelle, 08-10-2011, 02:31 PM
That is really all I have been asking from the start. Ballpark figure. Why is it so hard.

Posted by voipcarrier, 08-10-2011, 02:34 PM
Our guy on-site says the ATS might have fused. Minimum ETA is two hours.

Posted by KevinJ5, 08-10-2011, 02:34 PM
The tracert was from Today, right now. I am still getting higher than normal latency on that one hop but service does seem to be fine. I wasn't aware power failure redundancy had to be purchased. | 24.164.211.241 - 0 | 18 | 18 | 18 | 21 | 48 | 21 | | 24.164.209.112 - 0 | 18 | 18 | 15 | 25 | 49 | 19 | | 70.125.216.108 - 0 | 18 | 18 | 18 | 29 | 49 | 28 | | te0-8-0-2.dllatxl3-cr01.texas.rr.com - 0 | 18 | 18 | 22 | 31 | 51 | 27 | | ae-8-0.cr0.dfw10.tbone.rr.com - 0 | 18 | 18 | 15 | 26 | 53 | 15 | | ae-3-0.pr0.dfw10.tbone.rr.com - 0 | 18 | 18 | 19 | 24 | 49 | 20 | | xe-11-1-1.edge9.Dallas1.Level3.net - 0 | 18 | 18 | 27 | 32 | 50 | 29 | | ae-4-90.edge3.Dallas1.Level3.net - 0 | 18 | 18 | 22 | 31 | 74 | 22 | | COLO4-DALLA.edge3.Dallas1.Level3.net - 0 | 18 | 18 | 428 | 462 | 861 | 453 | | 206.123.64.130 - 0 | 18 | 18 | 20 | 28 | 50 | 20 |

Posted by TMS - JoseQ, 08-10-2011, 02:35 PM
ETA is typically impossible to produce until the last minute. In cases like this, you don't know how long it will take to fix until you have the solution, and service is typically restored soon after that. Any ETA provided is just a guess, bound to be picked up by Murphy's Laws to immediately make things worse. All we can do is trust that they will fix it as soon as they can. ETA is ASAP.

Posted by gamemaster, 08-10-2011, 02:36 PM
Two hours from now? Different time zones here so you are saying 4:30 EST / 8:30GMT?

Posted by Formas, 08-10-2011, 02:36 PM
Yes I agree. We need ETA. my conclusion is: if no one say is because no on know. So today will be a long day down

Posted by solidhostingph, 08-10-2011, 02:39 PM
Let's just wait rather than complain. I think they know what they're doing. I just hope it will be less than 2 hours. It's 238AM here.

Posted by voipcarrier, 08-10-2011, 02:39 PM
Yeah, two hours from now (1:39PM CST). This is coming from information my network engineer who is on-site could glean. Apparently it's a mad house over there right now. EDIT: Again, this was given to me as a best-case scenario and is likely a shot in the dark.

Posted by wynnyelle, 08-10-2011, 02:39 PM
I still can't get anything to load.

Posted by user5151, 08-10-2011, 02:41 PM
Paul, my tracert was from a few minutes before my post, and it was to your HSRP device at 65.99.244.225. It looks like you've switched over to Level3... but we're still seeing the same thing.. response stops at your HSRP router... C:\Windows\System32>tracert 65.99.244.225 Tracing route to 65.99.244.225 over a maximum of 30 hops 1 <1 ms <1 ms <1 ms STORMCLOUD [192.168.150.1] 2 6 ms 7 ms 5 ms 10.244.64.1 3 133 ms 6 ms 7 ms gig3-0-3.gnboncsg-rtr2.triad.rr.com [24.28.229.1 61] 4 11 ms 10 ms 9 ms xe-5-1-0.chlrncpop-rtr1.southeast.rr.com [24.93. 64.92] 5 19 ms 27 ms 15 ms 107.14.19.18 6 15 ms 14 ms 15 ms ae-0-0.pr0.atl20.tbone.rr.com [66.109.6.171] 7 29 ms 27 ms 27 ms xe-10-3-0.edge4.Atlanta2.Level3.net [4.59.12.21] 8 29 ms 28 ms 28 ms vlan52.ebr2.Atlanta2.Level3.net [4.69.150.126] 9 27 ms 28 ms 28 ms ae-73-73.ebr3.Atlanta2.Level3.net [4.69.148.253] 10 35 ms 35 ms 34 ms ae-7-7.ebr3.Dallas1.Level3.net [4.69.134.21] 11 36 ms 44 ms 35 ms ae-83-83.csw3.Dallas1.Level3.net [4.69.151.157] 12 34 ms 33 ms 33 ms ae-3-80.edge3.Dallas1.Level3.net [4.69.145.136] 13 * * * Request timed out. 14 * * * Request timed out. 15 * * * Request timed out. 16 * * * Request timed out. 17 * * * Request timed out. 18 * * * Request timed out. 19

Posted by nightmarecolo4, 08-10-2011, 02:42 PM
This has happened to me before at a datacenter for a large communication center whose data centers I managed. Temporarily run lines from Generator to UPS line and power it up that way for now bypassing ATS. Fix ATS and put it back online during an outage window with planned downtime. Maybe that would take 2 hours to achieve also. What a mess-

Posted by jumpoint, 08-10-2011, 02:43 PM
This may be a dumb questionn, but if you were to simply go and swap your cables from A to B side, would that work? Or is that something that would have to have been pre-configured.

Posted by layer0, 08-10-2011, 02:44 PM
There likely is not enough capacity on the "B side" to supply power to all units currently offline.

Posted by Eleven2 Hosting, 08-10-2011, 02:45 PM
Mr. VanMeter, Can you provide us with an update?

Posted by jumpoint, 08-10-2011, 02:46 PM
Yes, you're most likely right. So I guess that Colo4's stance about all of us having A/B is a bunch of BS too.

Posted by wynnyelle, 08-10-2011, 02:46 PM
I have no website and..nothing. Hours now. Unacceptable.

Posted by UH-Matt, 08-10-2011, 02:50 PM
It is a shame for us as we have an original old rack in the old facility with only a handful of servers left in it. We have a *new* rack in the new "unaffected" facility, and over 30 servers in this rack are down because apparently the network (which c4d setup) is a cross connect to our old rack, rather than its own network. So we have 30 servers in the unaffected building down and nothing that can be done which sucks quite a lot.

Posted by SomeGuyTryingToHelp, 08-10-2011, 02:51 PM
It has been determined that the ATS will need repairs that will take time to perform. Fortunately Colo4 has another ATS that is on-site that can be used as a spare. Contractors are working on a solution right now that will allow us to safely bring that ATS in and use it as a spare while that repair is happening. That plan is being developed now and we should have an update soon as to the time frame to restore temporary power. We will need to schedule another window when the temp ATS is brought offline and replaced by the repaired ATS.

Posted by arthur8, 08-10-2011, 02:52 PM
My entire business is offline. ETA?

Posted by Dedicatedone, 08-10-2011, 02:52 PM
Are they not SAS Type-II audited? Isn't that certification meant for scenarios like this one?

Posted by screwednow, 08-10-2011, 02:52 PM
I am bombarded by emails from paying customers unable to access our service. WTF is going on and when will it be back up? I can't treat customers how you're treating us by saying I Don't know what the server will be back up. I need an estimated time frame. NOW.

Posted by nightmarecolo4, 08-10-2011, 02:54 PM
I have just lost my biggest client

Posted by wynnyelle, 08-10-2011, 02:57 PM
i'm not sticking with this place that's for sure. Already spoke to my web host. He's leaving, too.

Posted by andryus, 08-10-2011, 02:57 PM
Still offline.

Posted by tchaffin, 08-10-2011, 02:59 PM
SAS70 Type II simply certifies the organization adheres to their written policies and procedures and an auditor has verified they are properly implemented.. The cert doesn't apply to physical, mechanical equipment and it's predisposition to potential failures or malfunctions. -Tom

Posted by Spudstr, 08-10-2011, 03:00 PM
ATS switches are always seen as a SPO

Posted by DirkM, 08-10-2011, 03:00 PM
The earlier comment that there were no previous outages is incorrect, as we had a similar "network blip" just 9 days ago, I just checked my records. I understand everyone's concerns as I have already been contacted by several of my clients, and much like you all, I have nothing to really tell them. Is this possibly a result of Dallas having day 39 at 100 degrees Farenheit? Really would like to be back up and running. Dirk

Posted by EGXHosting, 08-10-2011, 03:00 PM
The day after I open a host at colo4dallas is when they go down..... even though I am a reseller at another host still effecting me.

Posted by Ed-Freethought, 08-10-2011, 03:00 PM
The audit would likely take into account that fully diverse A+B power is available to customers that opt to pay for it.

Posted by nightmarecolo4, 08-10-2011, 03:01 PM
simply by-pass the ATS and power the UPS with generator power (re-fuel gen as needed until a good solution is in place). Join the lines up inside the ATS. Should have been done already- Its going to take a long time to wire another ATS

Posted by neumannu47, 08-10-2011, 03:02 PM
Just curious. How much money per month was your biggest client paying for hosting?

Posted by teh_lorax, 08-10-2011, 03:02 PM
Do you people who are screaming for an ETA really not understand how this stuff works? Really? You're IT people, right?

Posted by Spudstr, 08-10-2011, 03:02 PM
Thats not going to happen, the whole point of the ATS is to transfer the power load from generator pathway to utility. You just can't by pass it without taking everything back down again. Of course.. unless you are a tier4 datacenter with 2n+1/2

Posted by DONNER, 08-10-2011, 03:03 PM
Site's been down all day... This is unacceptable!!!! Thousands of dollars have been lost by many of us I'm sure.

Posted by DomineauX, 08-10-2011, 03:04 PM
Yeah it is hard to understand how UPS/generators haven't kicked in for this situation.

Posted by nightmarecolo4, 08-10-2011, 03:04 PM
$500 per month on a dedicated server to them that we managed

Posted by wynnyelle, 08-10-2011, 03:04 PM
I'm not really smart with all this stuff but...yeah...I was thinking all this time, don't they at least have a generator as backup?

Posted by CentralMass, 08-10-2011, 03:05 PM
I am questioning their tagline of "Unrivaled Support"!!!! Lack of eta makes me question that.

Posted by screwednow, 08-10-2011, 03:07 PM
I don't' remember being offered any A/B anything by my hosting company rimuhosting, or else I would've opted for it had I known it would prevent this 3+ hours of downtime. We need this problem fixed 3 hours ago.

Posted by soniceffect, 08-10-2011, 03:07 PM
As a customer of one of the webhosts on here I can understand your frustration. I have called my host about 10 times trying to get information and sent emails. However in the same respect as it has been said above, cant always give a timeframe because sometimes ya just dont know.

Posted by Ed-Freethought, 08-10-2011, 03:08 PM
Assuming that the A and B sides have the same capacity (why would you pay to install extra capacity on the B side until there is customer demand for it, you are just incurring CapEx by buying expensive equipment and OpEx by lowering your efficiency) then you would still need to either run new whips from each rack to the B-side PDUs or re-wire the A-side PDUs into the B-side bus. Neither of those is a quick or easy solution.

Posted by Michaelz, 08-10-2011, 03:09 PM
Quite epic now, and frustrating.

Posted by soniceffect, 08-10-2011, 03:10 PM
Epic ..... Now thats a description LOL

Posted by wynnyelle, 08-10-2011, 03:10 PM
I already threatened to quit my webhost for this. He's said he's quitting colo4 now though so we shall see.

Posted by kemuel, 08-10-2011, 03:14 PM
This is the real world people. When non-standard things happen you try your hardest and you're done when its done. If you can't estimate how long it's going to take you can't. If they underestimate it everyone will be complaining that they were lied to and how unjust that is. If they overestimate everyone will complain about how long it is still going to take from the start. It being unacceptable and such do not change the reality of the matter. They have quite a bit of redundancy in place but in the end you cannot protect yourself from everything. And yes my company is affected too, taking my anger out here won't bring it back. As for all the 'Why have the backup generators not kicked in???' reading the entire thread may be more advised than flooding it with the same questions over and over. In short these things happen, deal with it and rely on your datacenter or switch occupations.

Posted by Ed-Freethought, 08-10-2011, 03:15 PM
The ATS (Automatic Transfer Switch) connects the UPS to the mains/utility and the generator. It's entire purpose is to transfer the incoming supply to the UPS between the mains and the generator safely and automatically. With the ATS non-functional, the UPS is isolated from both mains/utility and generator supply.

Posted by icoso, 08-10-2011, 03:16 PM
Try telling your customers this (the truth): There has been a wide-spread power outage in the Dallas-Fort Worth area that is curently affecting hundreds of customers, including our Colocation Facility. The facility has backup/redundant power generators, which are operational. However, that power outage caused some issues with one of the ATS'(Automatic Transfer Switch) that provide the backup power to the facility. This is a wide spread power outage problem that is not only affecting the server colocation facility that our servers reside in, but undoubtedly other businesses, sub-stations, Telecom CO's, etc in the Dallas/Ft. Worth area that is affecting access to the server facility. Currently, there are about 500 customers in the DFW area that are completely without power. (Check the Oncor website at: http://www.oncor.com/community/outages/#) This most likely was caused by the heat related rolling blackouts that area has been experiencing recently. Right now the facility maintenance, vendors, and power company area ll working on the problems taht are affecting this entire area.

Posted by teh_lorax, 08-10-2011, 03:16 PM
You people aren't thinking logically with this frenzy that you're in. Colo4 isn't to blame. If your server is down, its because your host (or YOU) didn't opt for A/B power. Its an option for... well... when things like this happen!

Posted by soniceffect, 08-10-2011, 03:16 PM
Theres that damn karma button or +1 or like when ya need it? LOL

Posted by tmax100, 08-10-2011, 03:17 PM
Can you please tell me what happened to backup power system? Clients want explanation.

Posted by jumpoint, 08-10-2011, 03:17 PM
When you are paying for consumer level internet for $50/mo, just deal with it is an acceptable answer when you have an outage. When you pay a spectacular sum of money for a professional co-location facility that prides itself on uptime, just deal with it ISN'T ACCEPTABLE ON ANY LEVEL.

Posted by teh_lorax, 08-10-2011, 03:20 PM
And you could have that uptime, IF you opted for A/B power. It's pretty simple.

Posted by soniceffect, 08-10-2011, 03:20 PM
Neither is 100F + Weather for over a month but cant do much about that either

Posted by wynnyelle, 08-10-2011, 03:21 PM
I am paying...A LOT more than $50 a month let's put it that way. So one power source goes down and we're screwed?

Posted by jumpoint, 08-10-2011, 03:21 PM
Perhaps I don't completely understand, but shouldn't there have been a second backup ATS?

Posted by kemuel, 08-10-2011, 03:21 PM
We were told what happened, we know they are working on it. They cannot do more. Talking in caps and acting wronged may make you feel better but it will not actually change matters. So yes, you must sometimes accept things and deal with it as best as you can, just like them. How much you pay doesn't really change things. This does not exactly happen every week.

Posted by neumannu47, 08-10-2011, 03:24 PM
Ouch. That hurts.

Posted by jumpoint, 08-10-2011, 03:24 PM
If you aren't going to be reliable enough to supply the A line, then A/B should be standard on every account. 100% uptime is supposed to be able to happen, according to Colo4's own white paper, which I'm posting below in a separate post.

Posted by wbcustomer, 08-10-2011, 03:24 PM
What I love about all of the "don't they have redundancy" comments is this: Don't YOU have any redundancy for YOUR customers? Let’s face it, you get paid to provide a service to your clients and you have obviously offered no redundancy. Perhaps you are just angry that you got caught with your pants down. I too have a bunch of sites down. And this follows directly after another data center in Florida went offline due to a Cogent issue this morning. So today, ALL of my client sites either are, or have been down. I’m in the same boat you are in and yes, we are all losing money. However, the buck has to stop somewhere and I say it stops with you (the whiners). Put your money where your big mouths are and make sure ALL of your clients have their sites in multiple data centers around the world…..or shut up. And to the folks at Colo4 who are trying to fix this, Thank you!

Posted by wynnyelle, 08-10-2011, 03:25 PM
Why isn't it mirrored for disaster recovery? You know, since this is a disaster.

Posted by screwednow, 08-10-2011, 03:25 PM
Agreed 100%

Posted by nightmarecolo4, 08-10-2011, 03:26 PM
Normally if the power actually goes down like this for any reason (generator or utility or both) the UPS inverter plant will take the load. The UPS is really just designed to take glitches, but depending on how many battery cabinets there are and how much load on the unit a UPS could run for an hour. My worry is that if the UPS ran down all the batteries and shut down, there could be a secondary problem during re-power and re-charge mode. I know you dont want to hear that. Hope it will be ok. Thanks for posting about the utility outage, makes more sense now... Thanks Colo4 for doing your best. This kind of thing happens...

Posted by chum3728, 08-10-2011, 03:27 PM
There's no way to have redundant ATS' in a single power feed. 100% up-time is guaranteed with TWO power feeds, so if you have a single feed, you're at the whim of a single point of failure

Posted by Ryan G - Limestone, 08-10-2011, 03:28 PM
As far as I know there were no large scale rolling blackouts in DFW and if so they were supposed to be limited to residential areas.

Posted by teh_lorax, 08-10-2011, 03:28 PM
And you would still have 100% uptime if you had A/B power. It's not like this happens all the time. In fact, hold on to your chair for this one, even if you had A/B power, there is STILL a chance that something could go wrong with both plants at the same time. The chances are extremely remote, but still there. If that were to happen, would you still be crying? At what point do you go, OK this is really out of bounds of anything resembling normal operation?

Posted by wynnyelle, 08-10-2011, 03:29 PM
They may sure be doing their best to fix it right now but some kind of weakness in how they set themselves up to begin with caused this to happen. They're just scrambling to dig themselves out of a hole they dug themselves. From what I can tell, it isn't going too quickly.

Posted by MetworkEnterprises, 08-10-2011, 03:29 PM
1. The single worst customer service stance one can take is, "I know we massively screwed up on service you pay for and we are unable to provide. Now, if you had opted for a more expensive service level, we could have provided the service we promised you." While this may be true, you just don't say that in the middle of an epic screw up. 2. On twitter you can find several companies with the A/B power option who also happen to be down. Additionally, that colo4's site went down for a while indicates that they, also, either don't use the A/B power option or they do and it didn't work. Either way isn't good for them. 3. While I get the sentiment that they aren't doing this to us on purpose, that things happen, whatever excuse you want, it is mindboggling to me that they don't have someone standing by giving more frequent updates to their customers via web, twitter, however. Clearly, they must have quite a staff standing by with not a lot to do at the moment until the vendors and contractors get things up and running.

Posted by neumannu47, 08-10-2011, 03:30 PM
If lighting hits the electrical service panel on the side of your house, do you have a second one? The ATS is a serious piece of equipment. It sounds like there might have been some serious fireworks with some contact welding, but that's just a guess. Our business is down, too, but sometimes when you buy a suit with two pairs of pants, you burn a hole in the coat.

Posted by Ed-Freethought, 08-10-2011, 03:30 PM
There is, that's what the B feed is for. The A and B feeds are completely independent of each other. If one fails then the other should still be fully operational, as it is in this case.

Posted by nightmarecolo4, 08-10-2011, 03:33 PM
I expect A/B are 2 separate redundant feeds in the same data center. Just guessing of course. That means if the whole datacenter is down like it is both will not work. ATS is a single point of failure in this case by the look of it.

Posted by jumpoint, 08-10-2011, 03:35 PM
My point exactly.

Posted by Ed-Freethought, 08-10-2011, 03:35 PM
You simply can't provide complete redundancy on a single power line, there always has to be a device somewhere that is a single point of failure. The laws of physics demand it. If you could provide full redundancy on a single feed, there would be no reason to have dual feeds.

Posted by pvanmeter, 08-10-2011, 03:35 PM
Just wanted to give a quick update that was just given to the customers on site: We have determined that the repairs for the ATS will take more time than anticipated, so we are putting into service a backup ATS that we have on-site. We are working with our power team to safely bring the replacement ATS into operation. We will update you as soon as we have an estimated time that the replacement ATS will be online. Later, once we have repaired the main ATS, we will schedule an update window to transition from the temporary power solution. We will provide advance notice and timelines to minimize any disruption to your business. Again, we apologize for the loss of connectivity and impact to your business. We are working diligently to get things back online for our customers.

Posted by Formas, 08-10-2011, 03:35 PM
"A/B power" justification seems to be nor completely true. I see in other thread that some clients with A/B power still are down.

Posted by boskone, 08-10-2011, 03:37 PM
We, for one, have full A/B - we've been down in dallas the whole time. Suggesting having A/B will 'fix' this is ignorance of the facts.

Posted by Ed-Freethought, 08-10-2011, 03:38 PM
Is the power out on both the A side and the B side, or is there a network issue because a device somewhere upstream of you only sources power from one of the independent feeds?

Posted by Tobarja, 08-10-2011, 03:39 PM
Some customers are reporting being down even with A/B power. My understanding is grid power and backup power run into this ATS. The ATS is broken, so NO power is passing through it grid or backup. Am I mistaken? Why didn't you mirror it to another datacenter for disaster recovery?

Posted by boskone, 08-10-2011, 03:40 PM
We csn't tell if our machines are powered or not as, like colo4.com itself until recently, there is no network route past the colo4 edge.

Posted by icoso, 08-10-2011, 03:42 PM
http://www.oncor.com/community/outages/# https://maps.oncor.com/summary.asp

Posted by Formas, 08-10-2011, 03:43 PM
LOL. Sorry for LOL paul, but I read this same post in https://accounts.colo4.com/status/ +-1 hour ago. Seems that 1 hours past and nothing was done.

Posted by Dedicatedone, 08-10-2011, 03:44 PM
We get it, everybody is angry. I, like the majority of these other people following this thread are following it to get updates from Colo4, not to read about your frustration. It's business, you can only hope for the best and plan for the worst. We are currently working on lighting up another facility in Toronto to provide data center redundancy for our cloud clients. I wish we already had this in place, but it happens, welcome to the tech world. We're all in the same boat here, but let's please keep the posts to a minimum so we can concentrate on updates from Colo4. I hope everything is back up as soon as possible for all of us.

Posted by teh_lorax, 08-10-2011, 03:45 PM
It's likely due to scenarios like this:

Posted by MetworkEnterprises, 08-10-2011, 03:45 PM
And now it seems that though A/B servers would have power if they opted and paid for A/B, colo4's own routers are not using A/B so A/B customers would still... well... be down. Epic.

Posted by Ed-Freethought, 08-10-2011, 03:46 PM
That's correct, the ATS acts as an automated switchover between the grid/utility/mains (whatever you want to call it) power and the backup power from the generator(s). It sits in front of the UPS and if the power on one input fails (or is no longer providing the right voltage and frequency etc. such as in a brownout) then it disconnects that input and switches over to the other input. it has a couple of slightly complicated things to do - signal the auto-start panel for the generator, wait for the generator power output to stabilise and then make sure that one power input is completely disengaged before the other is engaged so that you don't short out hundreds of kilowatts or even several megawatts from two different power sources!

Posted by xolotl, 08-10-2011, 03:47 PM
Edit: Following Dedicatedone's lead, snarkiness removed... Last edited by xolotl; 08-10-2011 at 03:53 PM.

Posted by boskone, 08-10-2011, 03:48 PM
The colo4 network devices at the edge / border and core of our range are offline / unreachable. Packets aren't getting anywhere near our racks.

Posted by teh_lorax, 08-10-2011, 03:49 PM
Yeah, I'm sure they're all sitting around playing Sporcle and laughing.

Posted by boskone, 08-10-2011, 03:51 PM
Power issues at this scale are both complex and lethal. Everyone needs to take a deep breath and remember that colo4 are the experts, have everyone they need onsite and will fix this.

Posted by sunpost, 08-10-2011, 03:52 PM
What is preventing you from by-passing the ATS, rather than taking the time to install a backup ATS?

Posted by UH-Matt, 08-10-2011, 03:53 PM
Could we please get this estimate, even if its a very rough estimate... So we have something to aim for? Is this likely to be 30 minutes, 300 minutes or 3000 minutes away?

Posted by boskone, 08-10-2011, 03:53 PM
ATS cant really be bypassed without endangering the systems.

Posted by RDx321, 08-10-2011, 03:56 PM
Any updates on the ETA?

Posted by andryus, 08-10-2011, 03:57 PM
4 Hours of downtime please we need ETA !

Posted by Xtrato, 08-10-2011, 04:00 PM
So they have :

Posted by mindnetcombr, 08-10-2011, 04:00 PM
I bet more 4 or 5 hours of downtime, easy. I have 23 servers offline.

Posted by Eleven2 Hosting, 08-10-2011, 04:00 PM
Why are they just putting us up for a short time and then going to take it back offline again? This makes no sense. Make the final fix now and don't create a second outage.

Posted by Dedicatedone, 08-10-2011, 04:02 PM
They don't have the equipment right now. Put something up now and get everybody online then work on a permanent solution when you have a proper plan in place, not while you're in an emergency situation.

Posted by user5151, 08-10-2011, 04:03 PM
I'm glad we're not the only ones that see this... we've posted this twice in the forum to paul, and to our colo4 ticket... 65.99.244.225 is a colo4 HSRP router in front of our firewalls and equipment in colo4, and that colo4 HSRP isn't responding to a ping. So it appears that at least this colo4 HSRP router that's part of their premium managed services, isn't on their own A/B solution as well. Also... to clarify some confusion... utilityA > ATS-A ---- your equipment generator utilityB > ATS-B ---- your equipment generator This is what colo4's been referencing... the building that is affected, has two "service entrances"... two points at which they deliver power into the affected building (their other building, with 4 utility entrance points, are not affected). If you pay for A/B service, you have 2 circuits to your rack that are serviced separately by these A/B service entrances. If you're not paying for A/B service, you may still have 2 circuits at your rack... but they will trace back to the same service entrance.... so with the service entrance's ATS down, you're completely down. My problem is that even with A/B, it doesn't matter if the HSRP router (managed by colo4) isn't online either... my equipment could be powered up on the B power circuit, but you can't reach it, because colo4's HSRP router is down... presumably because IT isn't on an A/B service. Still no answer from them on this... but I'm apparenlty not the only colo4 customer who sees this same scenario, based on similar updates here. Also - colo4 justupdated their site: ---------------------------------------- Current Update Thank you for your patience as we work to address the ATS issue with our #2 service entrance. We apologize for the situation and are working as quickly as possible to restore service. We have determined that the repairs for the ATS will take more time than anticipated, so we are putting into service a backup ATS that we have on-site as part of our emergency recovery plan. We are working with our power team to safely bring the replacement ATS into operation. We will update you as soon as we have an estimated time that the replacement ATS will be online. Later, once we have repaired the main ATS, we will schedule an update window to transition from the temporary power solution. We will provide advance notice and timelines to minimize any disruption to your business. Again, we apologize for the loss of connectivity and impact to your business. We are working diligently to get things back online for our customers. Please expect another update within the hour.

Posted by Xtrato, 08-10-2011, 04:04 PM
I think having it fix right now would extend the outage.... they putting into place a temporary ATS so that it will bring all the servers online.. perhaps they will schedule a maintenance time during slow hours of traffic.. peak hours like now are very very inconvenient for everybody , especially like mindnetcombr who has 23 servers offline....

Posted by SH-Sam, 08-10-2011, 04:07 PM
9 servers offline for me right now- got a flood of support tickets! Hopefully this will be resolved soon!

Posted by wynnyelle, 08-10-2011, 04:08 PM
They only have one power line?

Posted by user5151, 08-10-2011, 04:13 PM
no, they have 6 power lines... 2 into this building, 4 into their other building (which isn't affected). The building that's affected, the ATS for ONE of those power connections (service entrace) has failed. So everyone whose racks have power from that connection is down UNLESS they paid for an apparently optional/upgrade service to have 2 circuits at their rack, serviced SEPARATELY by the two separate power connections (service entrance). service entrance = where the power company comes into the building.

Posted by Ed-Freethought, 08-10-2011, 04:14 PM
In this case I as referring to "power line" as the A-side supply in the Colo4 facility as jumpoint was calling it a "line". There are multiple power lines from the utility company to the facility itself. That isn't the issue here.

Posted by wynnyelle, 08-10-2011, 04:15 PM
Then they need to switch to the power lines that are working.

Posted by media r, 08-10-2011, 04:21 PM
Paul, could we please get another update? I'd just like something to send my customers to let them know progress is still being made. A simple status update would be just fine.

Posted by Ed-Freethought, 08-10-2011, 04:21 PM
You can't just "switch" a megawatt of power around like that. The supporting power infrastructure is large and complex and not something that should be played around with on a whim without any proper planning, never mind the fact that working with this level of electricity means one slip and you're dead. Customers that paid for full A+B redundancy still have power from the B side. Customers that opted to only take a single power feed from the A side will have to wait until the problem with the A side is repaired.

Posted by wynnyelle, 08-10-2011, 04:26 PM
Well I guess that explains how my host screwed up. They made a bad decision that's going to give me some serious food for thought.

Posted by boskone, 08-10-2011, 04:27 PM
That's false info. Lots of hosts with A/B are down in the DC effected. Colo4's own core network is also down (which one would imagine uses A+B power)

Posted by StevenMoir, 08-10-2011, 04:27 PM
Servers are down for more than 6 hours; Our customers are getting agitated; Is this the beginning of the end for all of US and COLO4? Steve

Posted by soniceffect, 08-10-2011, 04:29 PM
Just seen on a twitter post someone who just got off phone to colo4. Provisional eta of 1800 CST (around 2.5 hours) ... Ouch.

Posted by Garfieldcat5, 08-10-2011, 04:38 PM
I'd think twice about that, they have a pretty stiff ETF.

Posted by DomineauX, 08-10-2011, 04:38 PM
Should only be 4.5 hours down so far.

Posted by bear, 08-10-2011, 04:40 PM
Yup, our bells starting ringing about 12:03 Eastern. 4:40 now.

Posted by wynnyelle, 08-10-2011, 04:44 PM
Doesn't matter, if they offered a solution and my host didn't take that solution then my host had no place advertising themselves to me as having any sort of proof against this kind of thing. Someone's to blame. I wish I really knew who so I could form a plan of action.

Posted by iwod, 08-10-2011, 04:45 PM
You make your FIRST post since joining in 2005... Welcome I think most people would rather have Colo4 lied to them they need 8 hours to get it repair rather them knowing nothing. Although from a client / customer perspective. 8 hours of fixing and not telling them anything makes no different... Those who leave will leave.......

Posted by wynnyelle, 08-10-2011, 04:47 PM
Around noon is when we went down. It was shaping up to be a good, brisk day on my site too. It's become the year's worst disaster.

Posted by Patrick, 08-10-2011, 04:47 PM
Most people will stay. Let's be realistic here, it's not like Colo4 has power outages every month. Yes, there have been a few DDoS attacks in the last couple of months that affected network stability but hardly anything to worry about. People panic and freak out when they start losing money, understandable, and when it comes back online most people will move on and put this behind them.

Posted by iwod, 08-10-2011, 04:48 PM
I think people should just all logged into WHT, we get information here much faster

Posted by FideiCrux, 08-10-2011, 04:49 PM
I find it interesting that people are knocking the data-center for their redundancy. Complaints about clients leaving and businesses failing cause of not being able to receive e-mails and such. Where are your redundancy plans? Contingencies need to be planned for businesses as well, not just data-centers. Need your e-mail just in case your data-center gets DDoS'd or Power goes out. Setup a backup domain. Just as one can't plan for the weather, make it into work when it snows for that day (thus not getting paid), plan for disasters. -.-

Posted by JDonovan, 08-10-2011, 04:50 PM
We have A/B power on multiple servers and they are all still down. We are in the older facility. Even if we didn't have this addon, Colo4's statement regarding this was a poor excuse. We will be leaving Colo4 after this fiasco. There is no way an outage should last this long from a power failure, that's why you have backup plans. Acts of God are understandable but not this. There are some major clients who are affected by this right now. An example is Radiant Systems. Tens of thousands of restaurants have major parts of their systems down. Other major clients are experiencing the same thing.

Posted by Patrick, 08-10-2011, 04:50 PM
Take your logic and go elsewhere! This is WHT, logic often gets thrown out the window.

Posted by soniceffect, 08-10-2011, 04:50 PM
Mine is back up!

Posted by layer0, 08-10-2011, 04:50 PM
If a few hours of downtime is the year's worst disaster for you, then you really need to look into investing in geographic redundancy. To people that want to move - you can keep moving from provider to provider every time there's an outage, it's really not going to do you much good. An event like this is a rare occurrence at Colo4 and not something I'd pack up and leave over.

Posted by FideiCrux, 08-10-2011, 04:53 PM
WTB Logic!!!

Posted by ASIMichael, 08-10-2011, 04:55 PM
We have been a customer of colo4 for over 4yrs now. We are HAPPY.. This was a mechanical electric switch ( ATS ) failure. If customers have A-B Service feeds electrically to there cabinets AND wired correctly to the servers (dual power supplies) they are working. I DO. I paid for it. Folks dont think of WHAT IF. asimb

Posted by wynnyelle, 08-10-2011, 04:55 PM
now i'm wondering if it's my host or Colo4 who is at fault in this. I just want the truth.

Posted by Patrick, 08-10-2011, 04:57 PM
What do you mean? It's Colo4's fault... your host doesn't control the power infrastructure there. Even if they paid for A/B dual power feeds, it doesn't mean they would have service by now... there are some people who have dual feeds that are offline.

Posted by FideiCrux, 08-10-2011, 04:57 PM
Wikipedia on High Availability... Please note the following... "Availability is usually expressed as a percentage of uptime in a given year.:

Posted by wynnyelle, 08-10-2011, 04:58 PM
That's right, I was going on what someone said before that if my or any client had paid for A/B they would not be suffering this loss right now. Come to find out that isn't true.

Posted by nightmarecolo4, 08-10-2011, 05:01 PM
Paul-- Do you think we will be back online within teh next 2 hours? Can you ask the engineers?

Posted by wynnyelle, 08-10-2011, 05:02 PM
Any word yet? We're waiting. And I'm moving past where I can keep waiting. I have to get my site up, however I do that.

Posted by MtnDude, 08-10-2011, 05:02 PM
On a web server serving large amounts of static data, what is the best way to achieve geographical redundancy? Is Round robin DNS the way to go? What other alternatives exist? A load balancer would still be a single point of failure.

Posted by bostongio, 08-10-2011, 05:05 PM
Use a CMS like Wordpress that can host cached content on a cloud service like AWS. Very inexpensive and works automatically.

Posted by xnpu, 08-10-2011, 05:05 PM
I think you better open a separate topic for that.

Posted by cartika-andrew, 08-10-2011, 05:05 PM
I am not going to leave a provider over this - these things happen. We have been dealing with colo4 for years and have had good results for a long time. Stuff happens.. Having said this, I am a little upset at Paul over his comments here. This power outage is clearly something colo4 needs to deal with. Insinuating that customers are at fault for not having A+B power feeds is not reasonable. Firstly, Colo4's rates for A+B are not really inline with other facilities we work with. Secondly, we have A+B on some of our infrastructure with colo4 - and some of it is up - yes - but, some of it is also down. What this means is that some PDUs are serviced from the same power plant - so, even though we have A+B protection on parts of our infrastructure, the feeds are coming from different PDUs(but the same powerplant) - and we are still down. I just do not think it is appropriate to suggest that this may be colo4's customers fault because they dont have A+B. This is something that should have been discussed with your customers - not publicly. I am now answering questions about where we have A+B and where we dont and why - and frankly - the issue is colo4 lost a power plant.. lets try and remember that..

Posted by jawwad kalia, 08-10-2011, 05:05 PM
This downtime is turning people crazy, have a look at the reviews on GOOGLE MAP link. http://maps.google.com/maps/place?cl...=0CCAQnwIoATAA Webmaster ‎ - Aug 10, 2011 Down right now! 4 hours and counting. Unacceptable! No ETA!!! Owner emails me telling me they sell a backup power plan and we should have bought that. Wow. Run away - far away. Nice facility and good marketing spin, but horrible management and don't care attitude.

Posted by wynnyelle, 08-10-2011, 05:05 PM
5+ hours now. I still have no website. Hello Colo4, any word?

Posted by upstart, 08-10-2011, 05:06 PM
It is what it is. They claimed redundancy and uptime stats that are simply not believable any longer. Not saying good not saying bad. Sh** happens. By comparison our other data center in ATL had an airplane crash directly into a generator, transformers and 40' into the 1st floor. Lost main power feed and two primary carriers. Two weeks to replace gen, 14 hours for power restoration and 8 hours for fiber reconnect. DOWNTIME = ZERO. When a DC actually has redundancy, it's a beautiful thing.

Posted by layer0, 08-10-2011, 05:06 PM
Use a DNS provider like www.dnsmadeeasy.com who can handle either round robin DNS or strictly failover DNS. If it's mostly static content you can use something as simple as rsync to keep data synchronized between two different servers.

Posted by advwebsys, 08-10-2011, 05:07 PM
we've been at colo4 since december 2008, so i assume we are in the 'old' facility. we've got 11 servers in the rack, and at about 12:20 this afternoon, nine were back on line. two are still down. when i ssh'd into our main shared server, i noticed that the last boot date had not changed. this implies that those nine machines did NOT lose power. we have two separate circuits into the rack, which i do NOT believe are A/B redundant. at least that's not what we ordered, nor are we using them that way. however, i SUSPECT one circuit is on the A feed and the other on the B feed, but i have no idea. nor do i remember if both down boxes are on the same circuit (DUH!). is the problem that ***I'M*** seeing a routing issue, since i can't get a traceroute to either down box? even though all the machines in the rack on on the same class c net? perhaps someone smarter than me can enlighten me.

Posted by teh_lorax, 08-10-2011, 05:07 PM
*might not be true Apparently some people are still working. Things will get sorted out post mortem. There is really no need to jump to conclusions.

Posted by voipcarrier, 08-10-2011, 05:10 PM
It looks like they upgraded their webserver. It briefly said "It works!" and now it's back online. Ubersmith was offline too.

Posted by old_admin, 08-10-2011, 05:11 PM
Not sure about anyone else, but our systems just became accessible again about 20minutes ago.

Posted by rezilient, 08-10-2011, 05:13 PM
My KnownHost VPS is still down... Have the KH folks posted in here yet?

Posted by arthur8, 08-10-2011, 05:13 PM
Good question, i think round robin is the best solution.

Posted by jawwad kalia, 08-10-2011, 05:15 PM
They have nothing much to say, they are waiting for the issue to resolve...

Posted by nemonoman, 08-10-2011, 05:18 PM
I'm with TMS. Can somebody please help me clarify -- are my servers not powered up? Or is the network to my servers down? Or both? Thanks.

Posted by wynnyelle, 08-10-2011, 05:19 PM
They're all just sitting and waiting? Someone somewhere has to be working on this.

Posted by Tobarja, 08-10-2011, 05:23 PM
Get a grip dude. The reply was in regards to a specific colo4 reseller. "They"(the reseller) are doing what everyone else in this thread is doing: Fighting fires, talking to customers, and making sure their own crap is together. There's nothing anyone who is not colo4, a power contractor, or anyone not currently standing in the colo4 buildings can do to help. So, if the customers have been placated, we sit, we watch pings, we wait.

Posted by mcianfarani, 08-10-2011, 05:23 PM
What do you expect they do? Nothing can be done until Colo4 fixes the power issues. At that time your host will work on getting everything operational

Posted by LizQ, 08-10-2011, 05:26 PM
I would expect they are working on it frantically and over stressed already. Possibly to busy working to be posting on forums?

Posted by Hoopla-Brad, 08-10-2011, 05:27 PM
Last I heard they were taking a break at the pub Would you like us to start a Candlelight vigil?

Posted by leoncariz, 08-10-2011, 05:31 PM
For people seeking updates but don't have time to chase forum posts; twitter.com/#!/search/colo4 might be a good way to get updates delivered via Tweetdeck or twitter website.

Posted by wynnyelle, 08-10-2011, 05:31 PM
Only hitting refresh on my site. That's the only update that's going to matter in the end.

Posted by brendanm, 08-10-2011, 05:33 PM
I was actually going to ask this same question. Since ATS's can be a single point of failure for a individual electrical bus, they do make a ATS with a bypass feature, allowing you to force the source to either utility or generator manually. Then you can actually roll out the switch itself, repair it, then put it all back together. Does anyone know if they use BYPASS transfer switches?

Posted by bruuuuuce, 08-10-2011, 05:34 PM
There are some servers working -- I have a partner with a dedicated server, it hasn't even burped, I assume its in the other unaffected building; however, my VPS is down -- it sucks yes, but, KH couldn't prevent this, probably no-one could. I've been with KnownHost for 5 years now, so I won't be leaving them for something not their fault.

Posted by zdom, 08-10-2011, 05:34 PM
They are back colo4.com we not!

Posted by wynnyelle, 08-10-2011, 05:35 PM
I still don't see my site.

Posted by (Stephen), 08-10-2011, 05:36 PM
I don't know if they do, but I know that on many ATS switches the failsafe mode sends all to generator, and the facility runs while the ATS is swapped out/fixed. This happened at our new facility just in the last 6 weeks. Really it was about the time of the last storms, power was going on and off every 30/45 minutes very violently, and that was just long enough for the ATS to make a decision on leaving on generator or switching back to utility. In the end the ATS died, and it went to the failsafe generator for 8 hours while alerts of the ATS fail happened, and the ATS was fixed. There has not been much rain in the DFW area since, but there has been very high heat and a lot of power usage resulting in strain on the grid and quite a few brownouts and even some minor rolling blackouts, but enough to trigger the generator to run sometimes an hour a day. Last edited by (Stephen); 08-10-2011 at 05:40 PM.

Posted by nightmarecolo4, 08-10-2011, 05:38 PM
It is possible the bypass was screwed up too. Who knows. I would have checked the phase rotation and run gen lines to line side of UPS breaker, started the generator, then closed UPS breaker. I am a bit of a cowboy though (Have done that before with 750 KW)- actually gen lines should go to Buss of main switch..

Posted by djlurch01, 08-10-2011, 05:40 PM
Why is the colo4.com site loading so slow? It only contains very basic code with one graphic. Is everybody refreshing all at once? Their last update was 1.5 hours ago. I would like updates at least every 30 minutes. I'm having an extremely tough time because I am sitting on the trade show floor pushing my web app...with 3/4 of my servers down. I have been up front with my customers. I've sent out 2 mass emails so far and posted about 20 updates on my Facebook fan page. If you think it's ugly now, wait until Tomorrow when we have to answer: what are you going to do to fix this?

Posted by brendanm, 08-10-2011, 05:44 PM
Yeah it depends on the switch and the logic built into it. Mine once on gen with wait for utility source to come back, then starts a 30min countdown to make sure its stable.

Posted by wynnyelle, 08-10-2011, 05:44 PM
I know I'm going to have to go out on a limb to make this up to my site's users. But I'll do it. There was never a question in my mind of whether I would.

Posted by pvanmeter, 08-10-2011, 05:50 PM
Latest update just released. Our team and electricians are working diligently to get the temporary ATS installed, wired and tested to allow power to be restored. As the ATS involves high-voltage power, we are following the necessary steps to ensure the safety of our personnel and your equipment housed in our facility. Based on current progress the electricians expect to start powering the equipment on between 6:15 – 7:00pm Central. This is our best estimated time currently. We have thoroughly tested and don’t anticipate any issues in powering up, but there is always the potential for unforeseen issues that could affect the ETA so we will keep you posted as we get progress reports. Our UPS vendor has checked every UPS, and the HVAC has checked every unit and found no issues. Our electrical contractor has also checked everything. We realize how challenging and frustrating that it has been to not have an ETA for you or your customers, but we wanted to ensure we shared accurate and realistic information. We are working as fast as possible to get our customers back online and to ensure it is done safely and accurately. We will provide an update again within the hour. While the team is working on the fix, I’ve answered some of the questions or comments that have been raised: 1. ATSs are pieces of equipment and can fail as equipment sometimes does, which is why we do 2N power in the facility in case the worst case scenario happens. 2. There is no problem with the electrical grid in Dallas or the heat in Dallas that caused the issue. 3. Our website and one switch were connected to two PDUs, but ultimately the same service entrance. This was a mistake that has been corrected. 4. Bypassing an ATS is not a simple fix, like putting on jumper cables. It is detailed and hard work. Given the size and power of the ATS, the safety of our people and our contractors must remain the highest priority. 5. Our guys are working hard. While we all prepare for emergencies, it is still quite difficult when one is in effect. We could have done a better job keeping you informed. We know our customers are also stressed. 6. The ATS could be repaired, but we have already made the decision to order a replacement. This is certainly not the cheapest route to take, but is the best solution for the long-term stability. 7. While the solution we have implemented is technically a temporary fix, we are taking great care and wiring as if it were permanent. 8. Colo4 does have A/B power for our routing gear. We identified one switch that was connected to A only which was a mistake. It was quickly corrected earlier today but did affect service for a few customers. 9. Some customers with A/B power had overloaded their circuits, which is a separate and individual versus network issue. (For example, if we offer A/B 20 amp feeds and the customer has 12 amps on each, if one trips, the other will not be able to handle the load.) As you could imagine, this is the top priority for everyone in our facility. We will provide an update as quickly as possible.

Posted by gbcactus, 08-10-2011, 05:57 PM
I think they're flying in a new Flux Capacitor from Taiwan.

Posted by jimster4mex, 08-10-2011, 06:01 PM
now that is some good news, give news to us hard, give news to us fast. i especially appreciate the fact that colo4 is accepting fault for several engineering issues and i am sure that in the next several days that a team will discover more small problems that can be fixed in order to prevent a similar occurrence. one suggestion would be to host an emergency response site elsewhere... and to twitter every 15 or 30 minutes in order to take the pressure off of those of us who are at the mercy of our client bases. keep at it guys, don't panic and make things worse.

Posted by neumannu47, 08-10-2011, 06:03 PM
Everyone with clients climbing your butts needs to put together a package for redundancy that you sell to them at a very good profit. Even then, clients need to understand that all servers fail. I wonder how many clients complaining about downtime are paying $3 per month for the service. I hate being down, too.

Posted by Formas, 08-10-2011, 06:05 PM
Seems that colo4 coexistes with a lot of "small" mistakes. This is deeply regrettable

Posted by zdom, 08-10-2011, 06:08 PM
Seems like colo4 is a huge mistake.

Posted by BFDROW, 08-10-2011, 06:10 PM
From Colo4's About Us page: Guess we know who dropped the ball here.

Posted by teh_lorax, 08-10-2011, 06:12 PM
Except that you almost never have any good info until near the end of something like this. During a crisis there is usually nothing but panic and DISinformation. It certainly wouldn't have helped us to be panicking along with them.

Posted by advwebsys, 08-10-2011, 06:14 PM
be an adult. you probably weren't colo'ed at theplanet when they had ups burnouts and fried a huge number of servers - including 5 of mine. thus far, i haven't heard anyone who's come back up saying they lost data. i certainly didn't have that problem. life intervenes.

Posted by pvanmeter, 08-10-2011, 06:14 PM
As for the ATS questions, the ATS had a failure. It was prevented from operating correctly. While we all like to think "cowboy" and just start re-wiring and hoping, at these loads that is dangerous to our people and our customers’ equipment.

Posted by muppethunter, 08-10-2011, 06:18 PM
If nothing was/is wrong with the Dallas grid as stated in your earlier post then why did the ATS need to switch in the first place?

Posted by (Stephen), 08-10-2011, 06:21 PM
I am not trying to rag on you at all here, but just for info sake, their are 100% certainly ATS designs/systems at very high load levels that offer manual bypass systems for maintenance etc. Many are outside the ATS itself as to allow for the full maintenance of the ATS system: http://www.eaton.com/Electrical/USA/...ches/index.htm this is just one, there are many more industrial products such as this, is it not in place there or it failed also?

Posted by djlurch, 08-10-2011, 06:22 PM
Good question. A vast majority of us are still down, so of course this is the case. We won't find that out until the main power grid lights back up. This is one of my primary concerns right now. I KNEW they would eventually fix the power.

Posted by Eleven2 Hosting, 08-10-2011, 06:26 PM
I just find it stupid that they are fixing it today after this very long outage and then going to plan to take it offline again to replace it. Each time it goes offline that is countless amount of servers that are going to to go unrecoverable and just way more work involved.

Posted by nightmarecolo4, 08-10-2011, 06:29 PM
During a planned outage it does not need to go down. The UPS should be able to hold the load while they switch it.

Posted by muppethunter, 08-10-2011, 06:32 PM
Keyword "should". There were a lot of things that "should" have happened today but they didn't.

Posted by wynnyelle, 08-10-2011, 06:32 PM
Oh okay, well this downtime is still ongoing.

Posted by pvanmeter, 08-10-2011, 06:36 PM
I wasn't here when it happened, but my understanding is there was a blip of the lights not even detected by the UPSs on other service entrances. So, it may have been a minor "blip" or the blip seen on the utility could have been from the failure of the ATS. Rodney, we are installing the temporary to restore power as rapidly as possible. It will be weeks before the new unit is in place. Once that is in place, tested and load banked, we will set up a window for going to the permanent solution. Even with the UPS it might still need to have a short period went this entrance is down.

Posted by wynnyelle, 08-10-2011, 06:39 PM
So there will be more downtime? Thank you for the updates.

Posted by Garfieldcat5, 08-10-2011, 06:43 PM
The SLA mentioned in contracts says downtime of 16 hours in a 3 month period would allow people to exit the contract without penalty. Would Colo4 be willing to make an exception in this case and let people out if they wanted to leave?

Posted by pvanmeter, 08-10-2011, 06:43 PM
we just got another update. 6:15 looking even better and we may even beat that. Power has been restored to the distribution gear from the temporary ATS. HVAC units are now all online, and we will be beginning the process of restoring power to UPSs soon, then PDUs, and then customer equipment. We will update you as the other areas come online. Thank you again for your patience.

Posted by NeoComp, 08-10-2011, 06:45 PM
The Truth Behind Redundant Data Center Power, published by the president and CEO of Colo4

Posted by An0nym0usH0st, 08-10-2011, 06:46 PM
I am just simply astounded by the way some of you are acting on here. You are making yourself look like pure unprofessional business owners/personnel.* Before you start asking the question "Wheres the redundancy at Colo4?". You need to ask YOURSELF that FIRST. If you are true business owners you should have sat down with your partners and thought of several strategies.* 1. Disaster Recovery- What if the building catches fire and your stuff is the one affected? Are you ready to recover the data? Or are you going to blame your hosting provider for your lost data? Always pointing the finger at someone else so you dont have to catch the blunt end of things and makes you look better. Which leads to the next thing. R-e-d-u-n-d-a-n-c-y 2. Redundancy- Yea yea..You can say that your provider should provide "dual this, dual that, etc". But theres your first step. Relying heavily upon your provider for everything so you can have someone to blame for your mistakes. Get a business loan, buy a few extra servers, mirror image what you have, so if your provider does happen to go completely down your ready to make the transition to go to back up solutions.* 3. Mother Nature- Humans make errors every minute, every hour, every day. No single person is "perfect". Electric is not a completely perfect source of energy. Yea you can say "well if they had Generators that should keep it running". Wrong. Generators have different working conditions where they can properly run at certain temperatures. Once that threshold is hit you start running into over heating issues which makes certain components fail.* Theres a saying out there that some of your small business owners should remember.. "In order to make money, your going to need to spend money". Google has had numerous of outages some small some big. Sometimes you dont even realize because they have redundant/back up facilities to make sure it runs. State Farm Data Center in Los Calinas has 6 tug boat generators to support their facility. MASSIVE generators. They have had a couple of outages. But you never know this because they have got 2 other COMPLETELY MIRRORED facilities in the United States. One site goes down the other site can carry the load. All I am asking is that you business owners that are being affected by Colo4's VERY UNUSUAL outage today to sit down and rethink your strategies and quit pointing the finger at someone else for something YOU YOURSELF didnt think of before starting your business. And to remember that no is perfect and that faults happen.* Thank you, An0nym0usH0st

Posted by neumannu47, 08-10-2011, 06:47 PM
Why should they? They haven't been down for 16 hours yet.

Posted by neumannu47, 08-10-2011, 06:49 PM
Why do you have asterisks at the end of most paragraphs?

Posted by Garfieldcat5, 08-10-2011, 06:51 PM
Well if you lost a bunch of customers on this and can't afford to continue... I'm not saying this is us I was just asking the question.

Posted by wynnyelle, 08-10-2011, 06:51 PM
6:51 PM and I still have no site...

Posted by An0nym0usH0st, 08-10-2011, 06:53 PM
Haha I honestly am not too sure. I saw that after I posted it. Maybe because I'm on an iPhone? Weird. -An0nym0usH0st

Posted by neumannu47, 08-10-2011, 06:54 PM
That probably explains it!

Posted by djlurch, 08-10-2011, 06:56 PM
Anonymous: I have multiple servers and offsite backups. My mistake was having multiple servers (3/4) at one data center. You can be sure I am now in the market for another data center (not necessarily for those 3 servers, but for additional servers). I will also be looking at mirroring technologies for windows. I think we have a valid complaint that redundant systems failed. I pay a bit of a premium to be at Colo4. If backup wasn't important I'd host at truly budget colo. I think Colo4 has handled everything acceptably well. However, it is UNACCEPTABLE that they didn't use Twitter/Facebook to keep everyone up to date. I will be personally addressing this with them after things quiet down. They also needed more frequent updates. I sent two mass emails and posted 30+ times on my facebook page with information. Their only "annoucement" source (their site) was unreachable 50% of the time for me.

Posted by LizQ, 08-10-2011, 06:57 PM
Colo update: Power has been restored to the distribution gear from the temporary ATS. HVAC units are now all online, and we will be beginning the process of restoring power to UPSs soon, then PDUs, and then customer equipment. We will update you as the other areas come online. Thank you again for your patience.

Posted by neumannu47, 08-10-2011, 06:57 PM
If you lost a bunch of customers because of today's down time, you might not have adequately prepared them. A reseller cannot offer terms any better than the terms their hosting provider offers. If the host says 16 hours in 3 months, reseller customers need to be told the same. Unless you have some sort of elaborate fail-over system.

Posted by MetworkEnterprises, 08-10-2011, 06:58 PM
I think part of offering a service is standing up in front of your customers and taking the heat when the people you provide service to are let down. It's just how it works. Most of the people on here are going to lose customers. Most will be on the phone all day tomorrow being bitched at by one client after the next. People need to vent when they are upset. Paul knows that. I'm not sure he needs us all to offer our own versions of customer service to his clients.

Posted by Saberus Terras, 08-10-2011, 07:01 PM
Colo4 Dallas is on Central Daylight time. Still 15 more minutes before they hit 6:15

Posted by MtnDude, 08-10-2011, 07:08 PM
Great point. Can you please point us to good resources outlining techniques to do that? Say, like State Farm, you have a mirrored facility, do they use round robin DNS to ensure the standby facility takes over seamlessly?

Posted by wap35, 08-10-2011, 07:12 PM
I would suggest that KnownHost offer some sort of failover mechanism as an option. I imagine people would be willing to pay something like $10 to $20 extra for it - things like auto fail over from TX to CA.

Posted by An0nym0usH0st, 08-10-2011, 07:14 PM
Google has been very useful in helping me with questions an solutions that may arise.

Posted by BFDROW, 08-10-2011, 07:15 PM
This isn't easy nor quick to fix and I'm sure Colo4 will correct any shortcomings to make sure it doesn't happen again. One has to take a little responsibility as well. Our corporate email has been down all day which has impacted our company greatly. Should I have a secondary email server in another part of the country? Having several servers at Colo4 for over 7 years I can simply say I've never seen an outage like this and the few network interruptions have generally been brief. So do I want to spend the money to have a secondary server that might come in to play every 7 years? No, not really, doesn't make sense for us financially or any other way. Now, that's just a small example and we have some hosted app servers that we may look at doing some offsite redundancy just in case, if even for load balancing. Backups and redundancy at the end of the day, are the responsibility of the business owner. The buck stops here. A week from now you won't even think twice about this.

Posted by advwebsys, 08-10-2011, 07:17 PM
forgive me for saying this, but are y'all saying that this outage is going to make your customers leave? how may of y'all are running real time trading desks or airline reservation systems? i know some of you are, and i sympathize. on our website and in conversations with prospective customers, we tell them that we are proud of our uptime, but we NEVER guarantee anything. why? because crap happens - see my previous posts - especially what happened to us at theplanet. oh, and btw, how much are your customers willing to pay you for (a) redundant power supply servers connected to multiple circuits, (b) geographical redundancy and failover, (c) off site backup when they can go to godaddy or 1&1 and get all that (plus 500 mysql databases!) for $8.95 a month. and, of course godaddy and 1&1 have never had an outage. so let's get real.

Posted by jumpoint, 08-10-2011, 07:19 PM
Actually, we expect to loose many customers. We have a proprietary solution and our average customer is well over $100/mo. That client will not tolerate this, and nor will we.

Posted by brentpresley, 08-10-2011, 07:22 PM
Absolutely. Enterprise customers might tolerate some SCHEDULED down time 1-2 times a year, and an occasional blip here and there for a min or two. But 6+ hours? No way. There will be bleeding from this, lots and lots of bleeding. And it will go all the way from the smallest shared customers up to big players hosted in Colo4.

Posted by mdrussell, 08-10-2011, 07:22 PM
Then you need redundancy between two diverse data centers. Power issues can affect anyone and until we know what happened with the ATS failure, it would be wrong to point the finger.

Posted by sosys, 08-10-2011, 07:23 PM
great... 6 hours downtime!!

Posted by wynnyelle, 08-10-2011, 07:23 PM
16 hours down time in 3 months is a hell of a lot.

Posted by media r, 08-10-2011, 07:24 PM
I've just seen a few of my servers come back online.

Posted by teh_lorax, 08-10-2011, 07:24 PM
Whats the SLA in your Colo4 contract?

Posted by wynnyelle, 08-10-2011, 07:26 PM
Glad to hear some of you are getting service again. I know your pain..

Posted by advwebsys, 08-10-2011, 07:26 PM
well, i understand. are you saying that this kind of outage has such a deleterious effect on their businesses that they can go elsewhere? anyway, not appropriate here. i'm not trying to create noise. we're all in the same boat.

Posted by jumpoint, 08-10-2011, 07:26 PM
You can make that point until the cows come home, but I am FULLY redundant within Colo4, and I pay them to make sure I have power, cooling, and internet. Now, 15 minutes here and there is one thing, but for what I pay them, is 6+ hours of downtime fair? Absolutely Not.

Posted by mdrussell, 08-10-2011, 07:27 PM
If you pay them for redundant power, uplinks and cooling then yes, you are absolutely right to be aggrieved. I would be too.

Posted by argv1900, 08-10-2011, 07:28 PM
My servers are now working. Yeah...:

Posted by jumpoint, 08-10-2011, 07:29 PM
Yes, it sure does. They pay us to make sure they're up, and we pay Colo4 to make sure we're up. Our equipment is fully redundant, unfortunately we did not anticipate their entire data center would go down for 6+ hours with no end in sight. That's simply unheard of in a professional co-location setting. And yes, we are sure in the same boat, a boat which I hope soon rights itself for both of us!

Posted by ServerZoo, 08-10-2011, 07:29 PM
this says again that many so-called redundancy is just marketing talk....

Posted by MetworkEnterprises, 08-10-2011, 07:30 PM
SLA or not, an outage of this length through the main part of the day will see some attrition. That's just the facts. My point remains, however, that I'm not sure Colo4 needs random forum goers trying to appease/piss off/blame customers by telling them what they did wrong when they are paying Colo4 to avoid this exact type of issue. I would imagine that their customers are mad enough without folks here telling them they don't know what they are doing. Basically, your argument is that colo4 isn't to be trusted and the customer should have other plans. I don't think that's the message they want to convey. They are well aware that their customers are upset. Paul wouldn't have taken to a forum if he didn't expect to take some heat for the outage.

Posted by Tobarja, 08-10-2011, 07:30 PM
(emphasis mine) You should have been planning for Texas to go missing and still be in operation.

Posted by nightmarecolo4, 08-10-2011, 07:33 PM
Confirm-- my servers are coming back online

Posted by wynnyelle, 08-10-2011, 07:34 PM
I second this. And from now on I'm not trusting to have it just done for me. You want something done right you have to do it yourself.

Posted by sodapopinski, 08-10-2011, 07:36 PM
One server is up now and the other one is still down.

Posted by djlurch, 08-10-2011, 07:37 PM
Back up now. Look out non-dallas data centers. Here come some customers (probably not a flood of them). Besides losing customers, this will necessarily result in us implementing some sort of failover system. I'm going for a drink. It has been a LONG LONG DAY. Tomorrow will be worse.

Posted by MetworkEnterprises, 08-10-2011, 07:40 PM
Amen Brotha. Fortunately, I keep a bottle of Vodka by my desk.

Posted by Eleven2 Hosting, 08-10-2011, 07:41 PM
Everything is online for us.

Posted by pvanmeter, 08-10-2011, 07:42 PM
latest update: The power has been restored fully and all customers should be up. If you are a customer and have not come online yet, please open a help ticket for us to handle directly. In addition, we have deployed a team member to walk the data center and look for any cabinets not powered up. We will reach out to you and coordinate getting your equipment live for any that we observe in this process check. We will provide customers with a more detailed update upon completion of our after-action review for this incident. Our first goal at this time is to ensure everyone is up safely and all connectivity is restored. Thank you again for your patience.

Posted by wynnyelle, 08-10-2011, 07:45 PM
I'm not up at all. There is no site.

Posted by nightmarecolo4, 08-10-2011, 07:46 PM
we still have 2 servers down

Posted by zdom, 08-10-2011, 07:47 PM
Still offline

Posted by Tobarja, 08-10-2011, 07:47 PM
Are you a DIRECT customer of colo4? If yes, open ticket. If no, find your reseller, and BUG THEM.

Posted by MikeTrike, 08-10-2011, 07:52 PM
Holy crap, let there be light, I have power.

Posted by neumannu47, 08-10-2011, 07:55 PM
Our server has power, but when I open a web page, I get Bad Request (Invalid Hostname)

Posted by screwednow, 08-10-2011, 07:57 PM
Website still down (I'm a rimuhosting customer)

Posted by wynnyelle, 08-10-2011, 07:58 PM
i do have this disaster to thank for bringing some things to light for me.

Posted by tynman, 08-10-2011, 07:59 PM
Finally, my servers in Dallas are coming back online.

Posted by wynnyelle, 08-10-2011, 07:59 PM
nothing here yet. Going most of a whole day with nothing, I'm going to be doing serious damage control and I'll be doing it with my wallet.

Posted by screwednow, 08-10-2011, 08:01 PM
You're not alone.

Posted by nightmarecolo4, 08-10-2011, 08:02 PM
It sucks We still have 2 servers down

Posted by wynnyelle, 08-10-2011, 08:05 PM
Still nothing. It's after 7. My time, it's after 8. Worst day in the site's history for at least this year. In terms of losses. This is the sort of thing you are willing to spend hundreds or thousands over time to prevent, cause when it happens it can screw you up for weeks or months.

Posted by neumannu47, 08-10-2011, 08:11 PM
Have you opened a ticket with your reseller?

Posted by screwednow, 08-10-2011, 08:11 PM
Reseller's site is also down

Posted by Formas, 08-10-2011, 08:12 PM
Here server ping for 15 minutos and now is down again: === time out time out time out time out ===

Posted by generaldollar, 08-10-2011, 08:12 PM
Im glad you know what weve gone through today.

Posted by wynnyelle, 08-10-2011, 08:13 PM
Yup isn't it beautiful. At least my reseller has a Twitter account. Still down. Oh they're leaving Colo 4 too.

Posted by MyNameIsKevin, 08-10-2011, 08:13 PM
Slowly starting to come up now. What a cluster haha

Posted by generaldollar, 08-10-2011, 08:16 PM
yeah our call volume increased like 1000 percent today.. Terrible Wednesday. No lunches.. everyone stays late until we are back up and its still down... phone still ringin'

Posted by teh_lorax, 08-10-2011, 08:17 PM
Okay, this is weird Our TMS server never actually went down.

Posted by wynnyelle, 08-10-2011, 08:17 PM
I'm still down here!

Posted by andryus, 08-10-2011, 08:19 PM
still down one server .

Posted by MtnDude, 08-10-2011, 08:19 PM
Still down.

Posted by wynnyelle, 08-10-2011, 08:22 PM
Colo4, what's up? Wasn't this supposed to be all online now?

Posted by layer0, 08-10-2011, 08:23 PM
For people who are still down, it should only be a brief matter of time before you're back now that power is restored in the facility. Your physical server may be undergoing fsck or similar right now. All of our systems in Colo4 are currently back online, most notably our primary nameserver - but there's a reason why we have two others in different facilities. For those of you that lost a lot from this outage, you should definitely look into some geographic redundancy. Colo4 has nothing to do with this now - it's all in the hands of your own provider to make sure their own infrastructure is online and functional now.

Posted by sodapopinski, 08-10-2011, 08:24 PM
My server is still down..But update from their website: Updates Current Update: The power has been restored fully and all customers should be up. If you are a customer and have not come online yet, please open a help ticket for us to handle directly. In addition, we have deployed a team member to walk the data center and look for any cabinets not powered up. We will reach out to you and coordinate getting your equipment live for any that we observe in this process check. We will provide customers with a more detailed update upon completion of our after-action review for this incident. Our first goal at this time is to ensure everyone is up safely and all connectivity is restored. Thank you again for your patience.

Posted by nightmarecolo4, 08-10-2011, 08:25 PM
we are gonna pay for this still 2 servers down, customers starting to go nuts, several already changed name servers and gone elsewhere..

Posted by wynnyelle, 08-10-2011, 08:25 PM
Thanks--I'm taking this issue entirely back to my host now.

Posted by bruuuuuce, 08-10-2011, 08:28 PM
Remember folks ... posting "Still down" is about as stupid as posting "First" Also anyone seeing that their VPS uptime did not reset -- the Hypervisor was probably set to suspend your VM and when the power came back up the Hypervisor simply told it to resume -- it's like it never went down at all, also if you have a VPS you may want to learn about what the V stands for...

Posted by screwednow, 08-10-2011, 08:31 PM
When colo4 says "all customers should be up" and your server is still down, it's not stupid by any means to post that your site is still down. Thanks for your input.

Posted by bruuuuuce, 08-10-2011, 08:35 PM
You are very welcome! And it is stupid, because they have NO CLUE who you are, your account number, which rack your server is in, which ROW your rack is in, etc. And if you have any idea how big a data center is, and any clue that they wouldn't know if you had a physical server or a VIRTUAL server they would be trying to look for a needle in a needle stack.

Posted by solidhostingph, 08-10-2011, 08:39 PM
Arguing here will not solve any of your issues. Just wait.

Posted by upstart, 08-10-2011, 08:40 PM
Guess what...Still down.

Posted by FRCorey, 08-10-2011, 08:46 PM
Man that plain sucks. What facility is Colo4 in cause I have had no issues with my Dallas name server which is on a vps with linode. Still here's a lesson folks, have A+B power, and make sure that the datacenter is RAN OFF THE UPS and the power CHARGES THE UPS. Otherwise you do not have much in the way of redundancy when a 20 dollar part fails in the ATS. A lot of companies skip this necessary step because it's expensive and takes up a ton of room for batteries, but in effect it's the only way you have true redundant power. My COLO is up in Denver, and now at least if I figure out what facility this is from I won't expand there since Dallas is 2nd on my list for expansion.

Posted by generaldollar, 08-10-2011, 08:47 PM
Its not stupid some of us are employees and arent or have never necessarily contacted the data center directly. Maybe higher ups do that. I came to this forum for general information and to know if other users of this data center are experiencing the same issue. So its awesome to know that someone else is down and its not just us. Thanks.

Posted by layer0, 08-10-2011, 08:48 PM
Colo4 operates the facility that they're in.

Posted by nightmarecolo4, 08-10-2011, 08:50 PM
I have just found out that some of the PDU's are still down

Posted by Patrick, 08-10-2011, 08:51 PM
Yeah, I just heard that too. *shrugs* hey layer0. <3

Posted by imteaz, 08-10-2011, 08:54 PM
MY VPS is up with knownhost about 7:50 EST time.

Posted by FRCorey, 08-10-2011, 08:57 PM
I'm curious how colo4 runs that size of a building off 4 generators at all.

Posted by media r, 08-10-2011, 09:09 PM
Anyone else not able to ping anything from their servers @ colo4 presently? $ ping google.com ping: unknown host google.com Just curious.

Posted by Patrick, 08-10-2011, 09:13 PM
Change your resolvers: /etc/resolv.conf Change them to something like 4.2.2.2 and 4.2.2.1 for the time being...

Posted by solidhostingph, 08-10-2011, 09:16 PM
Can't do that now. My server is still down.

Posted by media r, 08-10-2011, 09:16 PM
Interesting, thanks. I guess I should probably run that by my tech, but what exactly will that do? I thought the resolvers always needed to be set to the ones provided by my provider.

Posted by Patrick, 08-10-2011, 09:21 PM
I'm guessing your resolvers are on servers that are still down within Colo4. The reason they recommend using the ones from your provider is that they are usually milliseconds away and that makes a lot of difference when querying DNS records. Naturally you want to be able to query DNS records as fast as possible. Personally, I don't think it makes THAT much of a difference to use public resolvers (e.g.: 4.2.2.2) but I suppose it's a matter of opinion and what the server is being used for. One thing you can do is have your providers resolvers at the top of /etc/resolv.conf and then the public ones at the bottom. If your providers resolvers go down, your server will then query the public ones.

Posted by pvanmeter, 08-10-2011, 09:21 PM
If you heard that some PDUs are still down you got misinformation. Some individual power strips in cabinets tripped when all servers tried to come up at once, but the PDUs are up. Also, there are 6 generators operational. Here is the latest update: As of 6:30PM CDT all PDUs are online supplying power to our customers. The UPSs are currently in bypass mode while the batteries charge. During this time we are operating on generator power, as is the best practice. Once the UPS batteries are fully charged, we will migrate the PDUs to the UPSs and then put power back on the utility. This process will be short in duration and begin around 9:00PM CDT. We do not expect any issues or impact to customers tonight while these transitions happen. We also have extra staff that will remain on-site throughout the evening. We have technicians working with our customers to help bring any equipment online that did not come back on with the power restore. There are also a few customers who tripped breakers while the equipment was powering up, and we are working to reset those devices. We appreciate your patience as we continue to bring this issue to closure. If you encounter any problems with your equipment or access, please open a help ticket so that we may respond in the fastest manner. We will update customers with an official reason for outage (RFO) once we assess the reports that were generated today.

Posted by screwednow, 08-10-2011, 09:24 PM
I'd like to open a ticket with rimuhosting, but their website is down too. So can you take care of these hosting company servers first so we can deal with them directly? This is out of control and damage control on our end is just piling up to a nightmare.

Posted by icemaax, 08-10-2011, 10:05 PM
Power Systems •Leibert and Powerware UPS with 120, 208, 220VAC available •Maintainance by-pass •2400 amps DC plant •100% diesel generator backup with auto start and auto transfer switch •Generator capacity equal to building utility service •A and B DC plants totally diverse •Fuel capacity exceeds 24 hours •Fuel delivery contract with 2-hour guarantee Lie information! Fail! Wrong redundant power....

Posted by LizQ, 08-10-2011, 10:21 PM
Heya, im a sysadmin at rimuhosting. We are waiting for host servers to be powered up or come online currently. Colo is responding very slowly, however out website is up (though slow) as is our live chat , facebook and twitter

Posted by pvanmeter, 08-10-2011, 10:40 PM
Here is the final update for the evening: To close the loop for tonight, the migration from generator to utility power for the UPSs is complete. All equipment and connectivity is functioning properly. If you experience any issues, please open a help ticket. We have extra staff on-site this evening and are walking the data center to ensure all equipment is online. Thank you. PS. I know there are still cases of customer equipment issues and we have extra staff on-site and are working tickets to help bring them online. Thank you for your patience.

Posted by LizQ, 08-10-2011, 10:42 PM
For clarification, Whilst colo has all their gear up and working and connectivity functioning, their customers are all still frantically trying to get servers booted and online. We need to make sure routers, switches, and other gear is online, not to mention running through RAID checks, fscks and similar. In our case we have a lot of servers to get online, so this may take some time.

Posted by Visbits, 08-10-2011, 10:49 PM
Ugh your ignorant. They had only 1 AC power system fail, their routing and other devices did not go down, only CHEAP customers who do not deploy dual power to their devices went down. AKA KH! Don't speak when your clueless

Posted by Formas, 08-10-2011, 10:53 PM
What happens with UPDATEs notices that had in this URL: http://www.colo4.com/ Now Colo4 website is backm but I need URL to update history. Some one have the URL? Thanks

Posted by Hoopla-Brad, 08-10-2011, 10:54 PM
https://accounts.colo4.com/status/

Posted by cartika-andrew, 08-10-2011, 10:57 PM
easy there.... not to be difficult, but, please do not make assumptions here. "CHEAP" customers were not the only ones impacted here. Some A+B did indeed go down. I dont blame anyone here - this was jsut a bad situation and frankly, this sort of thing can and will happen...

Posted by Formas, 08-10-2011, 11:03 PM
Thanks a lot

Posted by U2796, 08-10-2011, 11:06 PM
We DO have A/B power and we were down for about an hour.

Posted by ItsChrisG, 08-10-2011, 11:09 PM
and I dont fully believe that you didnt blow your breaker on your B circuit. If you are using power on both circuits its NOT A+B, its A+A. Your B circuit should have ZERO load so that if A goes down, B has the capacity to handle the full switch over without blowing its breaker. Per the status updates, the people who had A+B and went down, was because they blew their B side circuits! Datacenter outages happen, theres no escaping them, i dont care where you run away to with tears in your eyes. You cant claim ignorance as your excuse to your complaining. You should have redundant power, you should have redundant datacenter locations, otherwise -- you just dont care enough about your website or business and have no reason to cry as much as i see some people in here doing.

Posted by U2796, 08-10-2011, 11:24 PM
I agree with your sentiment - no QQ -- and complaining rarely helps in any situation. However, that being said, we did not blow B side circuits. Per the status update on the COLO4 website: 8. Colo4 does have A/B power for our routing gear. We identified one switch that was connected to A only which was a mistake. It was quickly corrected earlier today but did affect service for a few customers. ----- In our case, we were one of the few ... it was a switch. We have been with Colo4 for 6 years and have in that time experienced very little downtime. They have been great with customer service and we are still very satisfied. That being said, we WILL be adding an additional Colo in ATL for true redundancy.

Posted by media r, 08-10-2011, 11:33 PM
I just wanted to say thank you to Paul and to the other people who replied to my post in this topic. Glad this is behind us.

Posted by cognecy, 08-11-2011, 01:06 AM
Paul, First off as others have said before me, thank you for staying on top if this issue and getting resolved. I do have a request however around the pending ATS replacement of the new ATS unit in the weeks to come. I only ask that you find a way to install the new unit with Zero-Downtime. I know I am only using one customer as an example but this customer sticks out in my mind as he manufactures and sells PC's to the tune of $35-45K Per Day. The outage today obviously cost him a lot of money and definitely hurt his business. I have many customers like this one that sell around the clock where a few hours worth of downtime can literally cost them thousands of dollars. I know there are many hosts on this board that can likely attest to the same scenario with their own customer base. Keeping these growing customers is vitally important to our hosting businesses as it is their growth and success that spawns our own. I fear another outage in a few weeks time will only drive the nail in the coffin for many of them... even if it is during off-peak hours. If there is a way to avoid downtime please seek it out.

Posted by layer0, 08-11-2011, 01:35 AM
For the customers that you're referring to above, why not seek some sort of geographic redundancy for their setup? People keep coming here and saying they are losing $x,xxx or even more by being offline - if that's the case, you need to invest more into your infrastructure and not rely on any single data center. Edit: Just to clarify, I'm in no way trying to say you shouldn't expect Colo4 to thoroughly investigate this and do whatever they need to prevent it from happening again, but I personally think this type of event could theoretically occur at almost any data center. It may not be as long of an outage, or it may be even longer. If it costs this much to your business, you need to be prepared. Last edited by layer0; 08-11-2011 at 01:39 AM.

Posted by Shazan, 08-11-2011, 01:38 AM
I have two of my servers down as it seems the filesystem is damaged because of the outage. I've opened a ticket for their remote hands, updated it more than 2 hours ago and I am still waiting for a reply. Are there still many servers down to keep them that busy?

Posted by MtnDude, 08-11-2011, 02:16 AM
This is a question for Colo4: Let's say you have a measurable loss of revenue of $X due to the outage. However the outage is still within the QOS of 16hr/3mo downtime so no compensation shall be expected. If I want to write off my taxes this measurable loss, can Colo4 provide me a letter confirming the time/date of the outage?

Posted by iwod, 08-11-2011, 03:02 AM
I am wondering which Hosting Services are actually using them and affected? The 2 i know so far are KH and Dediserve.

Posted by wap35, 08-11-2011, 03:20 AM
I am very curious as well...

Posted by boskone, 08-11-2011, 03:27 AM
You can read all about the issue here https://accounts.colo4.com/status/

Posted by iwod, 08-11-2011, 08:01 AM
Is everything working for everyone now? Or am i the only one experiencing problems again?

Posted by MyNameIsKevin, 08-11-2011, 08:21 AM
Looks like I may be having some issues this morning. Just started actually

Posted by dediserve, 08-11-2011, 08:24 AM
We are seeing some network routing issues over level3 to colo4 - techs are investigating. Send a traceroute to your provider if you are seeing an issue.

Posted by ASIMichael, 08-11-2011, 10:32 AM
Ahh..the entire center did not go down. I was there..Seems you were not so dont say things. I hate when IT Managers or other corp managers jump to conclusions without knowing the facts. I was able to drive down. Things were calm and damn cold due to lack of servers creating heat. Lights were on and to my suprise a recently installed 2nd power bar was on the B electrical side so all I had to do was split my servers ( already redundant power supplies ) to the B power strip. HOW MANY OF YOU HAVE SERVERS WITH DUAL POWER? BTW.. Have you all considered that your ethernet switches and routers in your racks most likely do not have DUAL POWER? Be careful who you bite as your own ass might be next. I also overheard another client saying he was out..moving his servers north..perhaps to Oklahoma .. I said..wow weather would concern me the most.

Posted by Patrick, 08-11-2011, 10:45 AM
Haha. The people who are (foolishly) so quick to move on such a rare occurrence are the same people who will move when their next data center goes down, and then they'll move a third time and so fourth. Instead of setting up a redundant website / failover operation and accepting that EVEN THE BEST data center's still experience downtime, they will keep moving and keep whining... pathetic really.

Posted by relichost, 08-11-2011, 11:44 AM
Everyone suffers downtime, but theres no excuse to those who whinge about looseing 1000's of dollars when there site goes down, if they are making 1000's dollars then they should have a backup procedure in place. I cannot stress enough, people make backups.

Posted by teh_lorax, 08-11-2011, 01:02 PM
Guys, stop being logical. Come on...

Posted by tmax100, 08-11-2011, 01:04 PM
I hear some people saying "don't just whine about it, invest in infrastructure" (something like that). Can somebody tell me (point me to web article, etc) about this? How can I get ready so I can get around when data center is down next time? What kind of services do we have available for such incident? Thank you

Posted by layer0, 08-11-2011, 01:14 PM
It depends on your specific requirements for redundancy, and of course your budget. Proper geographic redundancy can cost a fortune, which is why most people choose to forgo this and absorb the costs/loss from a rare issue like this. If you just need redundancy for some aspects of your service like email and DNS, you may want to look at using www.dnsmadeeasy.com for your domains and an external managed email service like www.imap.cc (who is rock solid).

Posted by sirius, 08-11-2011, 01:21 PM
Services have been restored. If you are still having issues, please contact your provider. Sirius



Was this answer helpful?

Add to Favourites Add to Favourites    Print this Article Print this Article

Also Read
serveradmins review (Views: 647)
He.net Down??? (Views: 749)

Language: