Large Outage in San Francisco

We’re currently experiencing a rather large outage in San Francisco, presumably caused by a power failure to some of our colocation space there. All available resources are being brought to bear, and we’ll have more shortly. This outage is likely impacting a large portion of our DSL customers as well as some Dial, Biz-T, FRATM, and other services. -Nathan, Jared, and the rest of the NOC

Update 8:25AM: We have confirmed that the issue is caused by a power failure. Apparently the facility at 200 Paul Avenue is currently without utility power. The vast majority of the site’s generators, ATSes, and UPS systems worked properly, but the UPSes feeding suite 502 (the location of some of our equipment at that facility) did not transfer correctly. More to follow.

Update 9:19AM: Power has been restored to our equipment, and we’re ensuring that everything returns to service cleanly.

Update 9:32AM: Service to the vast majority of our customers has been restored. There appears to be one DSL aggregation device that is still having trouble — we’re taking a look at that now.

Update 10:22AM: All services have been restored. Please let us know if you still have any outstanding issues. We’ll be having a serious chat with our colocation provider about this event, as this outage occurred despite a massive investment in fully redundant A+B power — both of which died simultaneously.

13 comments for “Large Outage in San Francisco

  1. I’m curious: are there other ISP sites that give this level of information on outages? Sonic has come up as a possible colo facility in some of our internal discussions, but events like failure of power transfer worry me.

    I have clients with Sonic accounts and now and again check status for them when they are unable to reach Sonic by phone, as now.

    I don’t have a good metric to compare the number of disclosed issues to other providers, being relatively new at exploring the colo market. Consequently, I can’t really say if this is level of intermittent disruption is typical or worrisome.

  2. This affected my VOIP via DSL setup. I decided to just use USENET while you fixed it. Before I was done, it was up again, and my phone call worked. All is well!

    I’m hoping it was a weekend-related event.

  3. I have had sonic for over 7 years now, and I want to say GOOD JOB! This is the ONLY time during all those years we have been down. Thank you for years of excellent service and customer support. I’ve tried other providers and I wouldn’t trade you guys for ANYTHING.

  4. I have been with Sonic for a number of years now (?? 5 or more?)and I can’t recall another outage like this. I was personally impacted because I just swapped out my router to a wireless one (which is working fine) and woke up Saturday morning thinking that was the source of the problem.

    I am sticking with Sonic.net. I LOVE doing business with a GREAT LOCAL business.

  5. When the techsupport phone queue is large, it would be helpful to give a current network status message during the wait, as well as a URL to get more information on progress of the outage. This would keep the queue shorter and make the customers more informed.

  6. When an outage is impacting our call load, we do put a network status message in front of our Support queue. In the case of this particular outage, we had one up by about 8:30am, and customers calling in after this time would have heard the message.

  7. Hank,

    The problem we ran into with the phone system capacity on Saturday morning was that we’re equipped today with 46 trunks into our phone system. Typically we’ve got 6-12 support reps on duty, plus office staff, and this is plenty of capacity.

    But during this outage on Saturday AM we had five support reps on duty (plus the NOC team) but literally hundreds of callers. This yielded either a recording, “All circuits are busy, please try again later”, or a fast busy signal which indicates all circuits are busy.

    Support staff put a “Red Alert” message in place, which basically says that we’re aware of the issue and are working on it, and this greatly reduced the volume, but it still takes time for each caller to listen to that message, decide if their issue is the problem that’s being described, and hang up freeing the trunk.

    So that’s the back-story. As we’ve grown a lot, we need to add more trunking capacity, and also look into what we can do to route calls when we run out of capacity. For example, a route to a greeting that explained we were having capacity issues due to extreme call volume might be appropriate.

    We have over 30,000 customers – and while it’s been many years since so many were affected, imagine if any substantial block were. It would be impossible to have enough call queue capacity to queue them up, and the resultant hold times would simply be ridiculous.

    This is a key reason we put status messages online – on our home page, which mirrors this blog, and on twitter, and via email subscription. This, plus a “Red Alert” message allows us to inform as many people as possible as quickly as possible about the true and honest cause of the technical issue we’re dealing with.

    Finally, I’d like to say that I’m sorry about the outage! The facility that we lease in San Francisco is a large carrier hotel, and to their credit they have responded well to this equipment failure. The room is fed by two discrete UPS systems, and all of our key networking equipment is connected to both power sources – but upstream of those UPSs, there is an automatic transfer switch which toggles supply to the USPs between the utility (PG&E) and the diesel generators. In this case, the ATS physically broke – leaving the UPSs stranded, and eventually they run out of battery power if there is a failure of the transfer to backup generator.

    We have a pretty diverse network, with datacenters in San Francisco (two phases, with a third coming), San Jose, Los Angeles and Santa Rosa. These six facilities are all carrier grade, with significant investment in redundancy. But, there can be times when despite all of the planning, something breaks. In that case, we believe that being totally honest and forthright about the issue differentiates Sonic.net from other service provider choices that our customers might consider. I hope we can re-earn your trust in our reliability and that our integrity in disclosure is reassuring to you.

    -Dane Jasper
    CEO and Co-founder

  8. You said:

    The room is fed by two discrete UPS systems, and all of our key networking equipment is connected to both power sources – but upstream of those UPSs, there is an automatic transfer switch which toggles supply to the UPSs between the utility (PG&E) and the diesel generators. In this case, the ATS physically broke – leaving the UPSs stranded, and eventually they run out of battery power if there is a failure of the transfer to backup generator.

    I understand the UPSs running out of battery. I wonder if there was time for the “hotel clerk” to come do a “room call” to throw the switch manually, although from your description, both redundant ATSs physically broke (unless there’s only one?), and I’m wondering if they even had manual control, with or without the physical ATS failure(s). A lot of ifs.

    That they responded well is a good thing.

    Of course you’re great for your honesty. I’ve always stated that someone who admits faults, compared to those who don’t admit faults, is usually the one with the fewest faults, after all is done and told.

Leave a Reply

Your email address will not be published. Required fields are marked *