Rapidswitch down? [ Merged ]

Portal Home > Knowledgebase > Articles Database > Rapidswitch down? [ Merged ]

Posted by Dan-CKS, 10-18-2010, 04:20 AM
Is rapidswitch down for you lot?
Posted by Mattfc, 10-18-2010, 04:25 AM
Was down for a few minutes for me. Back up for me now. No power out. Looks like a routing / connection issue of some kind.
Posted by Rob-Rackedeu, 10-18-2010, 04:32 AM
Well from inital checks on our systems we belive there was power out as our linux boxes are now asking to run fsck.
Posted by CretaForce, 10-18-2010, 04:35 AM
4 servers are back online and didn't loose power. Currently I have 1 server down. Does anyone else have server down?
Posted by Dan-CKS, 10-18-2010, 04:35 AM
Yep my systems still down
Posted by Rob-Rackedeu, 10-18-2010, 04:38 AM
Just completed inital fsck's and now nothing responding again including myservers.rapidswitch
Posted by Dan-CKS, 10-18-2010, 04:39 AM
Strange how there website is online still though aint it.... I might give em a call shortly
Posted by gigatux, 10-18-2010, 04:41 AM
It's down here for all of our servers bar one. Our pings to one of our main VPS nodes has given 10 ping responses out of 283 attempts.
Posted by Rob-Rackedeu, 10-18-2010, 04:44 AM
Ive just called and been told no power outage but a problem with connectivity to peers. Not really sure why 2 seperate boxes was asking for fscks via vnc then. But apparently its being investigated now.
Posted by CretaForce, 10-18-2010, 04:48 AM
Can you tell me their number?
Posted by Dan-CKS, 10-18-2010, 04:50 AM
For you it will be +44 20 7106 0730 For UK ppl it will be 0808 238 0033
Posted by CretaForce, 10-18-2010, 04:56 AM
I don't believe it's related to peers. I can't ping the server which is down using other servers inside their network.
Posted by Jon-RackSRV, 10-18-2010, 04:59 AM
That and LINX for example is not showing any peering issues so its something independent to them it seems: https://www.linx.net/pubtools/trafficstats.html
Posted by rotame, 10-18-2010, 05:04 AM
One server and for me is down
Posted by Rob-Rackedeu, 10-18-2010, 05:04 AM
Don't shoot the messenger! That aside, i can connect to some of my boxes no problem. Others i cannot, nor can they ping out to external sites/ips (restricting some services). So who knows!
Posted by CretaForce, 10-18-2010, 05:12 AM
On the servers which are UP on their portal I see "Gb1 (RapidSwitch Network)" and on the server which is DOWN I see "OB1 Not Connected". Do you see the same? Also I run ping from another server on the same network to the server which is down and got 1 packet reply (1500 packets sent - 1 received).
Posted by CretaForce, 10-18-2010, 05:14 AM
And some more info. The servers which are UP are on F5, K9, B6, E2 on their datacenter and server which is DOWN is on B7.
Posted by Rob-Rackedeu, 10-18-2010, 05:15 AM
Latest Message From RS.
Posted by Russ Foster, 10-18-2010, 05:21 AM
http://servicestatus.rapidswitch.com/
Posted by Mattfc, 10-18-2010, 05:33 AM
My server is up, although Rapid Switch monitoring is telling me it's down. Monitoring does not appear to be working.
Posted by CretaForce, 10-18-2010, 06:04 AM
Sometimes the server which is still down replies to ping: 64 bytes from 78.129.226.xx: icmp_seq=3 ttl=63 time=0.186 ms 64 bytes from 78.129.226.xx: icmp_seq=177 ttl=63 time=0.498 ms 64 bytes from 78.129.226.xx: icmp_seq=189 ttl=63 time=0.355 ms 64 bytes from 78.129.226.xx: icmp_seq=254 ttl=63 time=0.347 ms 64 bytes from 78.129.226.xx: icmp_seq=258 ttl=63 time=0.516 ms 64 bytes from 78.129.226.xx: icmp_seq=266 ttl=63 time=0.346 ms
Posted by CretaForce, 10-18-2010, 06:06 AM
11:01 BST: The issue affecting RSH North is ongoing. A proportion of the traffic into RSH North will currently be experiencing increased latency and packet loss at varying levels. In some instances this may appear to be complete connectivity loss. We are investigating the cause and will provide further updates as soon as we possibly can.
Posted by GeckoCP, 10-18-2010, 06:12 AM
Completely down for 2 VPS's from my own connection and our monitoring stations. Rapidswitch's own monitoring also shows both are down as well.
Posted by CretaForce, 10-18-2010, 06:31 AM
Can you ping the servers which are down from other server inside rapidswitch datacenter?
Posted by gigatux, 10-18-2010, 06:38 AM
No - we certainly cannot do that. We have one server that seems to have no connectivity issues, and two that are unreachable. This is the ping from the working server to the two that cannot be properly reached:
Posted by blueskimonkey, 10-18-2010, 06:42 AM
Just over 2 hours of downtime so far Has anybody read the terms and conditions for SLA yet? They mention 100% uptime with exceptions to network outside of their control, however this issue seems to be in house? https://myservers.rapidswitch.com/Terms.aspx
Posted by CretaForce, 10-18-2010, 06:44 AM
I see the same result for your server: [chris@server17 ~]$ ping 109.169.0.2 PING 109.169.0.2 (109.169.0.2): 56 data bytes 64 bytes from 109.169.0.2: icmp_seq=13 ttl=63 time=0.665 ms 64 bytes from 109.169.0.2: icmp_seq=29 ttl=63 time=0.344 ms 64 bytes from 109.169.0.2: icmp_seq=113 ttl=63 time=0.654 ms 64 bytes from 109.169.0.2: icmp_seq=201 ttl=63 time=0.653 ms
Posted by Rob-Rackedeu, 10-18-2010, 06:52 AM
I think from reading all the reports on twitter and here that this is clearly not a external fault although it will be blamed on such to void the 100% SLA. (This has happened before). I have servers currently working fine with no issues at all in one rack, but others responding but no http connectivity in another.
Posted by zione, 10-18-2010, 06:54 AM
it's 4 hours now... we have 4 servers in RSH north....3 of them after a while are reachable (but with packets loss and latency) ...one it's not... any RS client here knows what's really happening with this company lately?
Posted by GeckoCP, 10-18-2010, 07:22 AM
The latest information I have as of 11:44 is that they have sent out a mass e-mail (which I haven't received yet, 12:20) however they don't currently have any more information other than there being an issue at RSH North and they don't know how long it will take to fix.
Posted by aeris, 10-18-2010, 07:29 AM
11 servers, of which 3 are unresponsive - that is, having 99% packet loss. And it's definitely internal, since servers on their own network cannot reach them either. Of those three, two are in North Upr while one is in North Lwr.
Posted by moozaad, 10-18-2010, 07:33 AM
Yep still on going. 2 boxes unreachable. Trace route ends at the VM hand off. I bet it's not covered by the SLA, they'll blame it on linx or someone else... I thought they were supposed to have redundant routes!?!
Posted by Nora A, 10-18-2010, 07:40 AM
i dont know why every afew days rapidswitch servers down , they dont have good service . alwayes problem in network , as i remember before 2 weeks go they are down and all clients servers for 8 hours if any one advice for good UK company please ?
Posted by CretaForce, 10-18-2010, 07:50 AM
12:30 BST: The initial problem with a peering point that occurred earlier this morning has led to a problem within the VSS routing cluster that services RSH North. Our network team quickly eliminated as many causes as possible. This issue was then escalated to Cisco and we are currently involved in joint investigation to try and discern the underlying problem. As soon as we have any progress from this work we will inform you immediately.
Posted by jarimh1984, 10-18-2010, 08:12 AM
I have contacted Racksrv & Poundhost already, too many problems lately with Rapidswitch.
Posted by Nora A, 10-18-2010, 08:14 AM
we losing clients and money with rapidswitch
Posted by CretaForce, 10-18-2010, 08:19 AM
Already 4 hours of downtime
Posted by ca-uk, 10-18-2010, 08:22 AM
Yes - it's a pretty poor performance. Rapidswitch seem to be consistently the worse performer between the various data centres we use - I know they're at the budget end of the scale but 4h is really unacceptable. To offer 100% SLA you would expect a certain level of redundancy - up to 1h i could almost swallow, but over 4h is really pushing the realms of reason..
Posted by rotame, 10-18-2010, 08:22 AM
I think its the worst downtime I had with them
Posted by ca-uk, 10-18-2010, 08:24 AM
We had almost a day downtime about 2 years ago (November 08) - they promised to placate people by offering free memory upgrades, etc, which they then never honoured. As other people have eluded to, i suspect they won't honour any attempt at claiming reimbursement for the downtime.
Posted by rotame, 10-18-2010, 08:28 AM
Yes you have right, now I remember it
Posted by CretaForce, 10-18-2010, 08:28 AM
Now all servers are down.
Posted by Rob-Rackedeu, 10-18-2010, 08:30 AM
Yes, everything is down now. I can only hope that it is to reboot the whole cluster or a final fix is in progress.
Posted by cobra3000, 10-18-2010, 08:31 AM
Same issue
Posted by LP560, 10-18-2010, 08:32 AM
I have two colocated machines with RapidSwitch which are both offline (just moved over to them) - not the best move I've made! Not too pleased with their recent antics, I as many others here will be looking for some sort of compensation.
Posted by dropby23, 10-18-2010, 08:33 AM
yes main website of rapidswitch also down
Posted by JonQuay, 10-18-2010, 08:39 AM
Yep. Had one box down all day, now the other is down as well. And the RS website. Fortunately for us it's mainly backups / small sites etc hosted there. There was a time I considered moving our main operations over; so glad I didn't. Useless.
Posted by Harzem, 10-18-2010, 08:39 AM
I have a VPS with a company located in Rapidswitch, and it is down too.
Posted by oooooo, 10-18-2010, 08:39 AM
I don't care too much about compensation - but I do care very much that our website is down. No compensation that they are likely to pay will make up for this.. so my question is... Who should we co-locate our servers with in future? (We currently serve around 10 - 15 million page impressions per month and have a turnover of over £15 million per year. We expect and need MUCH MUCH better service than this. I will certainly be pushing our tech team to investigate moving). Is it just me - or have Rapidswitch been getting worse and worse lately. Since they were bought out, they seem to be constantly planning outages - not to mention occasions like this!!! Absolutely diabolical.
Posted by CretaForce, 10-18-2010, 08:40 AM
Now all servers are back online.
Posted by JonQuay, 10-18-2010, 08:42 AM
Back up here too.
Posted by moozaad, 10-18-2010, 08:43 AM
IS there a website that shows average uptime between outages and uptimes of the various DCs and providers? I'd really like to 'go compare' Over the last 12 months I think my average uptime is 50-60 days between incidents. Total downtime is approx 24-30 hours. PS. looks like its all back up. (they turned it off and on again!)
Posted by ca-uk, 10-18-2010, 08:45 AM
Yep - back online here too. oooooo - I think you get what you pay for in some respects, we've always used rapidswitch as the budget option, as we've been with them over 4 years, they've always had more network issues than the other hosts we use. our other servers are with dedipower, rackspace and nowadays we do a lot with amazon web services - of the three i'd investigate the aws option, as although it's not the option you can design your application for failure and take advantage of multiple availability zones. while annoying, then we also have a responsibility to build redundancy into our systems - for some people, they don't want to pay for it, and as such downtime is a frustrating fact of life Last edited by ca-uk; 10-18-2010 at 08:51 AM. Reason: lift == life
Posted by davidman, 10-18-2010, 08:50 AM
To be fair, if your turnover is so large, having redundancy built into your systems ought to be a priority but I completely understand your frustration. We're getting rather annoyed at all the downtime too. My impression was that last years outage regarding the router would prevent routing problems in the future. Now it seems not to be the case.
Posted by XTremo, 10-18-2010, 09:00 AM
Still waiting here!
Posted by JSHosts, 10-18-2010, 09:00 AM
Our servers with RapidSwitch appear to be back to normal now, but the downtime lately is just ridiculous. I would love to move to another datacentre but don't want to cause our clients even more downtime. We recently moved our shared/reseller services to a Newcastle datacentre and uptime has been great so far.
Posted by Amitz, 10-18-2010, 09:08 AM
Unfortunately, my VPS with FutureHosting is still down. Damn. I really love FutureHosting and their service, but RapidSwitch as their datacenter sucks a bit lately...
Posted by oooooo, 10-18-2010, 09:08 AM
Hi ca-uk and davidman Thanks for your replies. I'm only semi-technical myself, but will be passing opinions on. We reviewed rackspace when we moved to rapidswitch - but as I remember, they didn't offer co-located servers. I assume that is still the case? As I remember, when we last reviewed our systems, we were mainly concerned with ensuring we had our data covered against all extreme eventualities. So we have our data and our code encrypted and stored in a couple of other locations. We believed the 100% uptime claim of Rapidswitch - but felt that even in an exceptional circumstance, we could probably get our code and data live with another host within approximately a day. With hindsight, I guess we just didn't protect against situations like these. Being not fully up on these things - how do you build in redundancy when you need up-to-the minute account data for people? Do you just have to keep on copying the database to multiple locations (sounds horrible)? Also, I assume we are talking about a system of automatically switching to spare live servers as necessary (load balancing I guess) - rather than changing DNS settings to send traffic to a new web-host location?
Posted by Ed-Freethought, 10-18-2010, 09:09 AM
There are several reputable companies based in Maidenhead, but at the carrier neutral BlueSquare campus of 3 facilities where RapidSwitch used to be before they built Spectrum House (there are still photos of BlueSquare 2 on the RapidSwitch web site) and got bought out by iomart. It's probably worth checking out your options there as they are likely to provide the quickest migration option for your co-located server due to the proximity (it's less than three miles between Spectrum House and BlueSquare House).
Posted by aeris, 10-18-2010, 09:09 AM
All servers appear to be back online, but they all lost connectivity for a brief period first. They did honor free upgrades or a free month of service, but you had to request it. The only reason I'm still with them is that these are a bunch of mostly dirt cheap servers from before the price increases, that can push 10 TB each.
Posted by Ed-Freethought, 10-18-2010, 09:12 AM
Generally doing high availability without support built in to the applications is never going to be perfect, but there are all sorts of ways of gracefully handling such situations through clustering and/or virtualisation if you know what you are doing. These often don't lend themselves well to the budget end of the market however due to needing some more specialised and expensive hardware.
Posted by Harzem, 10-18-2010, 09:20 AM
Exact same here, but it's a backup VPS for me so I'm not overly worried yet.
Posted by ca-uk, 10-18-2010, 09:29 AM
lol - they obviously like you more than us then - we just got false promises and sent round in circles with "soon..." but it never happened so we gave up!
Posted by ca-uk, 10-18-2010, 09:30 AM
there are many many ways of building in redundancy - so we don't hijack the thread then send me a PM if you like with more details of what you do and i can send you a few thoughts.
Posted by nikg, 10-18-2010, 09:49 AM
Stil down. At least they could post some status updates on their site. Looks like it's time to sart looking for a new DC...
Posted by GeckoCP, 10-18-2010, 10:03 AM
It's up for us now however still having a few problems with alerts from the monitoring servers every now and then.
Posted by XTremo, 10-18-2010, 10:04 AM
I know Vik at FutureHosting wasn't best pleased with the maintenance outage at RS last month! Hopefully FH will take a good look at Bluesquare because they're going to end up losing customers through no fault of their own, simply because of their RapidSwitch connection.
Posted by Amitz, 10-18-2010, 10:24 AM
You are right - If I won't be taken so much by their services, I would consider switching hosts. It's a pity for them (FH) that these issues are out of their control. Anyway, my VPS is still down and am not too happy about this...
Posted by dufu, 10-18-2010, 10:34 AM
Still down for me too, this is taking waaaay to long
Posted by JayX, 10-18-2010, 10:50 AM
Yeah, I read the same mailout you guys are referring to and Vik is definitely not too pleased as it stood anyway. A lot of people are going to take this out on FH when it's blatantly not their fault there's nothing but trouble in the London datacentre. Did we used to be hosted at BlueSquare? I could've sworn that was the UKHQ when I first signed up, not Spectrum House.
Posted by Ed-Freethought, 10-18-2010, 10:52 AM
RapidSwitch used to have an entire floor in BlueSquare 2 (but still claim it as their own datacentre), but they moved to Spectrum House around May 2009
Posted by XTremo, 10-18-2010, 11:03 AM
And it couldn't have come at a worse time of the day! I've got around 100 UK business clients on there and my phone hasn't stopped ringing because of it. I've explained the situation to them but as you know the average end user just sees unreliability and unprofessionalism. Especially when they're in contact with other businesses on my books who are located on my HostDime VPS in Bluesquare where it's just business as usual. I love FH....but if this goes on much longer (and it has to be around 3 hours now) I'm going to be put in a position I don't want to be in, due to client complaints caused by RapidSwitch.
Posted by JayX, 10-18-2010, 11:16 AM
Yeah, the timing is a nightmare this time around. Any problems beyond a few minutes I've had tend to be around midnight-2am (I think a lot of people batch processes then, and the VPS hits ridiculous load while my pages are quiet) so it's not the biggest issue. Within a few minutes of it going down today, I had my mum on the phone asking where her email was. Who needs Pingdom when you have an addicted parent?
Posted by LP560, 10-18-2010, 11:19 AM
I sent them an email a few hours ago, apparently they still haven't found the root cause nor have they had the all clear as of 16:00. They will be sending a refund though, but looks like you will have to ask and provide proof!
Posted by mrzippy, 10-18-2010, 11:28 AM
Open a new thread in the forums here to ask and you'll get plenty of advice. Short answer - if your website is critical then you should invest in a failover system. Every datacenter on the planet is going to have unexpected problems, so prepare for them.
Posted by SlAiD, 10-18-2010, 11:38 AM
RapidSwitch is still down. Anyone "inside it" know an ETA? My host dosn't help mutch on it. They probably know the same as I do. SL
Posted by aeris, 10-18-2010, 11:46 AM
All hosts go down some time or another. Even the best ones occasionally experience multiple simultaneous points of failure or, in rare cases, random explosions. So if you want ~100% uptime, you will have to host at several different locations and use some form of fail-over. Then again, RapidSwitch seems to have had more than its share of random downtime over the past couple of years, which is unfortunate. Hmm. Well, we had 11 servers with them back then (down to 9 now), but I don't really consider that "a lot". Either way, I ended up requesting a bandwidth upgrade on all servers instead of something that would cost them real money, and this was applied within minutes. Over time, I've saved a bundle on that.
Posted by XTremo, 10-18-2010, 11:50 AM
Vik, the FutureHosting CEO has contacted me (and probably the other clients) personally.....as he's a very hands-on guy! And they're still waiting for further updates from Rapidswitch! I have no doubt whatsoever that something will be done about this!
Posted by JayX, 10-18-2010, 11:53 AM
What did Vik's mail say? I've not heard anything because (stupidly) my contact email on their system is the one hosted at RapidSwitch!
Posted by SlAiD, 10-18-2010, 11:59 AM
Sites are back online! SL
Posted by SlAiD, 10-18-2010, 12:00 PM
Same here... Could you paste the e-mail here? SL
Posted by XTremo, 10-18-2010, 12:01 PM
Just came back online.....cross your fingers!
Posted by Amitz, 10-18-2010, 12:02 PM
Haven't received the email, too. Even though my eMail is externally hosted. But my VPS is back online after 3h 39min of downtime!
Posted by XTremo, 10-18-2010, 12:13 PM
I've no doubt that even though Vik couldn't get through, you'll be hearing from him soon.
Posted by Amitz, 10-18-2010, 12:23 PM
<-- Snip --> Hello, The upstream Data Center has restored connectivity to our infrastructure and your VPS/server is back online. We understand this has been extraordinarily frustrating for you and us as well and we can only offer our sincerest apologies and assure you we are not standing by without action. We will post a RFO when we have received a full and detailed report. In addition, our *management team will have additional information posted related to our overall European operation soon. Once again we are very sorry for the inconveniences caused. Thank you, Bill Future Hosting 39555 Orchard Hill Place Novi, MI 48375 <--Snap-->
Posted by zione, 10-18-2010, 02:08 PM
This email follows up today's network issue with further information, to give you our latest understanding of the issue. Date: 18/10/2010 Time: 09:20 Duration: <270 mins Affected Service(s): IS-01366, IS-01611, IS-01612 The issue affected servers in the North side of our Maidenhead Data Centre, Spectrum House. (RSH-North) Approximately 50% of servers were affected, so generally speaking 50% of the servers listed above would have been affected. It is not possible to say in retrospect which these were, because our monitoring servers were affected and hence they recorded an outage for all servers, including those at other data centres that were not affected at all. We have cleared this misleading monitoring data. At ~08:55, we became aware of a network issue affecting some servers in the North side of our Maidenhead Data Centre (RSH-North). Approximately half of these servers were experiencing connectivity problems ranging from packet loss to total loss of connectivity. Other servers were unaffected by this issue, and were responding as normal. Our network monitoring server was amongst those fully affected by this problem and therefore reported a total outage, including for servers hosted at other data centres and not affected at all. We are in the process of clearing this misleading monitoring data. The issue we detected was affecting both the primary and secondary Cisco 6500 network system that are configured in a VSS-1440 redundant cluster. We ran through our emergency procedures to identify the problems, but all tests were responding within normal parameters. After finishing our emergency procedures, and not identifying a specific problem, we raised a case with Cisco TAC at ~10:10. A Cisco engineer then logged into our routers to try and identify the problem. After 3 hours, the Cisco engineer was unable to provide a resolution; we understood the problem was either a software bug within the routers, or else a hardware fault. We took matters into our own hands at ~13:20, and decided to reboot both routers. This affected all servers in the RSH-North data floor, as it takes about 15-20 minutes for the routers to reload. During the reload, the primary router failed to boot up normally. The secondary router booted normally, and our monitoring showed service was restored as a result of this. Our conclusion is that the failure of the primary Cisco 6500 to boot indicates a hardware problem. We take full responsibility for all the infrastructure required to provide you with a reliable service, and therefore we asked Cisco to provide an answer to these questions: 1) Why were Cisco unable to diagnose a hardware fault within a 3 hour time frame? 2) Why did traffic not automatically fail-over to the secondary 6500, as by design? Cisco commented that they do not know for sure if this is a hardware problem, and so were unable to provide a specific response to these two questions. Clearly these are very important questions that need to be answered, and we will continue to work with Cisco to provide a full and adequate response to them. Regards, The RapidSwitch Team
Posted by zione, 10-18-2010, 02:11 PM
So if I understood well we are relying on the backup router while the primary is still waiting for a Cisco tech or a replacement... So let's wait for a scheduled downtime when they will reboot both routers and do what they (maybe) should have done from the beginning...checking if the second router takes action when the first one fails
Posted by PCS-Chris, 10-18-2010, 02:16 PM
I personally know they have a third Cisco 6500 router seperate to the VSS cluster should there ever be an issue with the cluster. This was added following the previous major network outage. I would be interested to know why they didn't fall back to this router at least as a stop-gap while the cluster was having issues.
Posted by SlAiD, 10-18-2010, 03:24 PM
I'd be surprised *maybe not* if they told us it's NOT RS fault... oh whell, welcome to the hosting industry I guess. SL
Posted by moozaad, 10-18-2010, 03:31 PM
Their reply to my SLA claim was to say 'Show us your logs' ... so I pointed out the 5 messages they sent me telling me my connectivity was broken. I doubt I'll get away with that though. It's handy that they turned off their monitoring during the crisis. We'd have never known they were down except that we use them all day everyday. ಠ_ಠ It would have been nice to have been told about the outage by their montioring systems without us having to start poking. Neither emails or sms arrived for that period.
Posted by tandyuk, 10-18-2010, 07:39 PM
Their service status page last message says "As soon as we have any progress from this work we will inform you immediately." Still waiting for that update. I still cant believe it took 5 hours before they decided "Lets give it a reboot", I thought everyone in the IT industry knew thats the first thing you do when a piece of hardware has a problem. I too have sent an SLA credit request.
Posted by Harzem, 10-18-2010, 07:43 PM
No, that's the first thing to do when a Windows device has a problem Most *nix machines can be fixed (unless you are removing hardware) without rebooting.
Posted by CraigMesser, 10-18-2010, 08:05 PM
Whats the % SLA Credit they give out of interest?
Posted by plumsauce, 10-18-2010, 08:11 PM
You need to be running one or more hotspares. The account data is kept up to date using the replication mechanisms built-in to the particular flavour of database that you are using. In considering the pros and cons of dealing with data that may be slightly stale, the decision has to be made whether you want to be able to serve most of your clients during an outage on the main server, or if you would rather be completely down. The type of load balancing you are speaking of when looking at distributed high availability is global load balancing and it *is* done with DNS. It's just that the DNS manipulation is automatically handled by the monitoring system of your specialist dns service provider. There is no "getting the server up in 24 hours or so", because some clients on a "bad server day" can be automatically flipping between their data centers 20 or more times in a 24 hour period. Last edited by plumsauce; 10-18-2010 at 08:16 PM.
Posted by tandyuk, 10-18-2010, 08:21 PM
0.5 days credit, per hour of downtime. Limited of course to the amount you pay for services in a monthly period, but it would take 72 hours plus downtime to get to that stage. I for one would have driven to maidenhead, collected my servers, and put them in a different datacenter long before that though!
Posted by astutiumRob, 10-18-2010, 09:50 PM
It's only the first thing to do if you're a fool that thinks "The IT Crowd" is a documentary ...
Posted by Amitz, 10-19-2010, 02:26 AM
It isn't?? I thought they broadcast it live from the RapidSwitch facilities!
Posted by plumsauce, 10-19-2010, 02:47 AM
Maybe it's not the first thing to do, but it is something to consider reasonably quickly when under the gun. It is an especially reasonable thing to do if it is a resource availability problem. And, as it turns out, it would have been a faster diagnostic in the case at hand.
Posted by GeckoCP, 10-19-2010, 04:37 AM
Exactly! As an I.T professional rebooting something to fix it should be the last thing you do otherwise you never get to the root of the problem and leave it possible to happen again so effectively your reboot doesn't actually fix the problem it just makes it go away.
Posted by Jon-RackSRV, 10-19-2010, 05:19 AM
It would have also caused complete downtime (rather than packetloss) to RS's entire client base which according to their own blog post would have lasted +20mins
Posted by plumsauce, 10-19-2010, 06:03 AM
And in the end it still had to be done. Whatever the reboot time, and 20+ minutes is on the high side for that series, an earlier reboot would have saved some pain. The decision depends on the situation, the experience and the instincts of the particular admin.
Posted by blueskimonkey, 10-19-2010, 06:25 AM
Just had a response from Rapidswitch for my credit request, they are asking for logs to prove the downtime.
Posted by ca-uk, 10-19-2010, 06:48 AM
I have requested a credit for our servers and they have said they will process it - they didn't ask for any logs, but i did raise a ticket yesterday at 930ish to say we were completely inaccessible, so maybe that's enough. Now i'll just wait and see if it shows up
Posted by Harzem, 10-19-2010, 07:09 AM
Great post. One can't compain about repeated downtime/problems if he thinks the first thing to fix a problem is just rebooting.
Posted by tandyuk, 10-19-2010, 07:32 AM
I wasnt suggesting that to do a reboot is what will fix it, however given that these devices are DESIGNED to work as a redundant pair with automatic failover, it should be fully possible to reboot either of them at any time, without any affect on the service provided... Isn't the point of having a redundant pair? Or is my understanding of what "Automatic Failover" means wrong? Furthermore, I would never stop investigating an issue simply because a reboot 'fixed' it. The goal is to get services working again asap, and continue to investigate what caused the issues once this has been achieved. Do your servers delete their logs when you reboot... Mine certainly dont, you simply have to go further back through the logs to get to the relevant messages. Any device which does reset its logs upon a reboot should be configured to send those logs to a remote system which will save them using syslog, or something similar. In most SOHO environments, If a connection starts playing up, a router reset is one of the first things on the list. Virgin even put a sticker to that effect on their modems, requesting you reboot it before even bothering to call tech support. I'm just staggered that given the level of problems that were being experienced yesterday, that it took over 4 hours before this was even considered, and lo and behold once done, the problem then became very apparant.
Posted by tandyuk, 10-19-2010, 07:36 AM
I raised a ticket with them stating the times Pingdom reported services were unavailable (9:17 - 13:43). They haven't asked me for anything else.
Posted by nikg, 10-19-2010, 02:25 PM
It's true that DCs could have some incidents from time to time. But this one showed that Rapidswitch infrastucture is a joke. What is the point of a redundant router if it can't be used in such cases? Was the configuration tested for a variery of scenarios or they were expecting one to happen in order to find out if all traffic would be routed through the backup router? Monitor systems giving false warnings? How about a redundant monitoring system in case one goes down or goes mad? Communication with clients or updates during the incident? They don't bother. A hardware fault on both routers at the same time???? Don't think so. Blaming someone else (in this case Cisco) is not a good answer for me. And this bring us back to my previous question. Was the router/cluster configuration tested before deploying it? Having a 100% failure and a whole DC going down is unacceptable. And the best part is that we should prepare ourselves for another downtime soon while they will try to fix the problem. Since our clients don't know RS, but us that we are offering them hosting, they are going to blame us again and the possibility to loose some next time is high. So our management team has already start looking for other alternatives to move our servers to.
Posted by tandyuk, 10-19-2010, 03:22 PM
Not to mention that currently, as one of the 'redundant cluster' has a hardware fault, the entire dc is likely running on the remaining good router. I can't see them having £50k of spare equipment on site, so until it is repaired, what happens if the second one has a similar hardware fault?
Posted by ca-uk, 10-19-2010, 08:38 PM
kabooooommm
Posted by Steven, 10-20-2010, 02:49 AM
Anyone who is still using Rapidswitch are fools. Do you honestly think they will get better before you lose all your customers?
Posted by Dan-CKS, 10-20-2010, 04:13 AM
haha whatever. not like your isp is any better
Posted by astutiumRob, 10-20-2010, 07:40 AM
Every server/service/supplier/software/system has an issue at some point - if the level/frequency is outside your acceptable level for the price you are paying, then move ... Of course they'll blame you - it was *your* choice of supplier that caused the client to experience an outage - they pay you for a service at a set/expected level - who else do you think they should blame/call/whinge-at ?
Posted by VMhosts, 10-20-2010, 08:11 AM
As customers of RapidSwitch the recent outage has caused us problems. I dont think our customers blame us for our choice but we are certainly their point of contact. Unless our customers feel that we should have a crystal ball to predict such failures. As with any problems we face openness goes a long way with customer loyalty. Our theory has been to this point that the problems seen at RapidSwitch would be learnt from and moving providers could mean similar outages which hadnt been seen and learnt from. Unfortunately this is the second time RapidSwitch have faced such problems with the same Cisco cluster. I would think all of their customers have some difficult (or simple depending on their infrastructure) decisions to make about moving their physical equipment. I will add though the quality of their engineering response and network performance has always been great for the % time that is up.
Posted by CraigMesser, 10-21-2010, 06:46 PM
To be honest, not meaning to speak ill of them but Rapidswitch being a more budget host isn't relevant. The fact is they offer 100% uptime and therefore should be able to cover it. Unfortunately they seem to be pulling the short straw a lot lately.
Posted by Dan-CKS, 10-21-2010, 10:02 PM
Budget host? i think not. Infact your servers are cheaper OVH is a budget host...
Posted by LP560, 10-22-2010, 09:05 AM
I've just been refunded 4 days service, not much but better than nothing! Still no word from them regarding the Cisco cluster situation, they will probably wing it on one router for as long as possible!
Posted by tandyuk, 10-22-2010, 09:20 AM
That's what worries me!
Posted by Rick Lee, 10-22-2010, 11:46 AM
I don't see why you guys are slagging rapidswitch I've seen noting but good remarks and a lot of people use them. Your just all slagging rapidswitch off I have gotta put it to them. the network is mainly stable and fast. Its all online for me? Last edited by Rick Lee; 10-22-2010 at 11:50 AM.
Posted by zione, 10-23-2010, 04:42 AM
Steven, have you got any suggestion from your experience? Who do you recommend in EU?
Posted by LP560, 10-28-2010, 06:59 AM
We will be carrying out maintenance which is applicable to some of your services with us. Maintenance Type: Network Maintenance Expected effect on your service: No Effect Expected downtime duration: 0 minutes This will occur between 16:30 and 17:00 on 29 Oct 2010 (UK Time) This maintenance is required to implement changes recommended by Cisco that reduce redundancy temporarily. This should not have any effect on the service provided by RapidSwitch. We apologise for any inconvenience this may cause, please do not hesitate to contact us if you have any queries or questions regarding this maintenance window. Regards, The RapidSwitch Team ..... Would have been nice to know what they are doing, i.e. replacing, removing router or just messing around - I suspect the latter.
Posted by VMhosts, 10-28-2010, 07:00 AM
I called them about 30min before I got this notice and was told there was no update.... odd
Posted by aeris, 11-24-2010, 12:35 PM
Seriously?
Posted by CretaForce, 11-24-2010, 12:48 PM
I notice a 1-2 minute outage (pingdom). Nothing else.
Posted by gigatux, 11-24-2010, 12:51 PM
A bit longer here (15-20 mins from monitis), affecting 2 of our 3 servers. However, the affected sites worked from my home connection, so it wasn't all routes that were affected.
Posted by zione, 11-24-2010, 12:58 PM
our BW usage dropped from 10 MB to nothing all of a sudden and then slowly ramped to normal There is a deep spike on the BW graph....can't wait to move every server we have at RS elsewere