Month: February 2010

San Francisco datacenter running on generator

Currently PG&E utility is offline at the 200 Paul San Francisco facility, and it is running normally on generator.  Automatic transfer switching worked as designed, and all is well.

After the failure last week of a transfer switch in this San Francisco datacenter, it’s good to see that repairs to that redundant power system worked and that the repaired transfer switch did its job.

-Dane & Nathan

2n CRAC Redundancy Pays Dividends

At 7:11PM tonight one of our two Core4 CRAC (Computer Room Air Conditioning) units unexpectedly shut itself down.  Nothing instills fear more than receiving pages titled “Sys A Enable Switch Turned Off / Service Now” and “High Discharge Air Temperature” in rapid succession.  After the initial panic passed, it was clear that all redundant systems we operating correctly and the second system had responded correctly and ramped up to handle the total cooling load.  Once on site, there was no outward indication why System A had shutdown as the system enable switch was correctly in the “On” position and it had supply power.  However, upon further investigation, it was apparent that the enable switch was water-logged, oxidized and shorted-out, signaling the system to shutdown.  The switch has been removed from service and both systems are 100% operational again.

Although it is disappointing to see a system failure caused by something as simple as an improperly weatherproofed mechanical room and control panel, it is rewarding to see our commitment to and investment in redundancy pay off.  And, ultimately, that the prototype Core4 CRAC system behaved as expected.

Special thanks goes out to Jimmy and Kent of Bell Products who interrupted their dinners to come out and verify that all systems were functioning correctly.

-Kelsey, Nathan and Russ

Non-Impacting Transport Issue

One of our backbone network transport links began having issues this morning. We have removed that link from service, and are routing traffic internally around the problem as we work with the transport circuit provider to diagnose the intermittent problems. These problems did not cause any customer impact, but as we route around the problem, customers may notice sub-optimal paths inside Sonic’s network (i.e. from San Francisco to San Jose via Santa Rosa).

We are keeping a close eye on the situation and will restore normal routing once we are certain that the transport circuit is fully resolved.

-Jared and Nathan

Secure Server Service maintenance

This Wednesday (24, Feb.) morning at 12:01 AM the Secure Server Services will be un-available for approximately 30 minutes; this affects all customers with services on ssl.sonic.net or secure web hosting services with us.

After the maintenance these customers will see a substantial performance boost.

–Augie

Large Outage in San Francisco

We’re currently experiencing a rather large outage in San Francisco, presumably caused by a power failure to some of our colocation space there. All available resources are being brought to bear, and we’ll have more shortly. This outage is likely impacting a large portion of our DSL customers as well as some Dial, Biz-T, FRATM, and other services. -Nathan, Jared, and the rest of the NOC

Update 8:25AM: We have confirmed that the issue is caused by a power failure. Apparently the facility at 200 Paul Avenue is currently without utility power. The vast majority of the site’s generators, ATSes, and UPS systems worked properly, but the UPSes feeding suite 502 (the location of some of our equipment at that facility) did not transfer correctly. More to follow.

Update 9:19AM: Power has been restored to our equipment, and we’re ensuring that everything returns to service cleanly.

Update 9:32AM: Service to the vast majority of our customers has been restored. There appears to be one DSL aggregation device that is still having trouble — we’re taking a look at that now.

Update 10:22AM: All services have been restored. Please let us know if you still have any outstanding issues. We’ll be having a serious chat with our colocation provider about this event, as this outage occurred despite a massive investment in fully redundant A+B power — both of which died simultaneously.

Changes for former Humboldt Internet customers.

Earlier today we migrated the remaining humboldt1.com Internet services over to Sonic.net.

This will provide those customers with improved service and reliability that all Sonic.net customers enjoy.

If you are a former Humboldt Internet customer and you are having trouble checking your e-mail or your website, please contact our Support team :

http://www.sonic.net/contactus.shtml
1 (707) 547-3400
support@sonic.net

–Augie

Unexpected DSL Aggregation Router Reboot

At approximately 3:00 AM today, one of our DSL aggregation routers serving customers in the Bay Area suffered what appears to be a software crash. This software crash kept the router out of service until 3:14 AM, at which time it was booted up and handling customer traffic normally.

The exact cause of the software crash is being investigated with the vendor at this time. We are monitoring the router, and all affected customers appear to be back online at this time.

-Jared

Update: The router has unexpectedly crashed again. We are in the process of moving all customers served by this router to a hot spare. Connectivity should be restored shortly.

Update: All affected customers have been moved to the hot spare successfully. All affected customers should be online again at this time.

DSL Outage in Stockton area

At this time we are tracking an ATM backhaul problem affecting LATA9. This LATA covers DSL service for the Stockton/Modesto areas. We believe the problem to be on AT&T’s backhaul side, and are working with them currently to diagnose and repair.

-Jared

Update: We have narrowed the issue down to a specific line card in an ATM device in our network. Due to the nature of the problem, we had to perform a reload of the entire device. This briefly affected customers in LATAs 1, 2, and 6 as well. We apologize for affecting previously unaffected customers while reloading this device.

CLEC DSLAM Maintenance

This evening, at 12:01AM, we will be performing a software upgrade on the DSLAM that serves Fusion and FlexLink customers in the downtown Santa Rosa area. Expected downtime is less than 30 minutes while the DSLAM is rebooted onto the new software release.

-Tim and Nathan

DSL Aggregation Router Reload

This evening, Saturday, February 6 at 12:01 AM we will be performing maintenance reloads of two Redbacks that terminate traditional DSL service. This will affect all Chico DSL subscribers, and some of our Bay Area DSL subscribers. Expected downtime is 5 minutes.

-Jared

Update: The reloads have been completed without incident. All affected customers should be back online at this time. Total downtime was less than 5 minutes.