Category: Network

CLEC Intrusive Maintenance – SNFCCA21

At 12:01AM on 9/2/10 we will be performing intrusive maintenance on the DSLAM serving the (San Francisco) SNFCCA21 POP. All Flexlink Ethernet and Fusion customers served from this POP should experience between 5-15 minutes of downtime while the work is performed.

-Matt, Tim and Jared

Update: Maintenance work has been completed. Downtime for affected customers was less than 5 minutes.

Network Instability

At 12:55pm today it was brought to our attention customers on our network were experiencing intermittent reachability to parts of the internet.  We determined the source of this to be a routing issue with one of our upstream providers. At approximately 1:08 we routed traffic away from the affected upstream, thus mitigating the issue.

The upstream provider has since fixed their routing issue, but we are performing further tests on the link before returning network traffic back to it.

-Matt and Jared

Edit: Fixed incorrectly stated times in the above message. -Matt

Non-Intrusive Santa Rosa T1 Aggregation Router Maintenance

This evening at 12:01Am we will be performing a card swap on one of the aggregation routers serving legacy T1s out of our Santa Rosa office.  This should not be customer impacting, but like most maintenance there is a small potential chance of downtime.

-Matt Tim and Jared

Update: This router maintenance has completed without incident.

San Francisco ATM switch failure.

At approximately 10:40 we had a hardware failure on an ATM switch in San Francisco. We are presently rebooting it. Approximate downtime should be 5-7 minutes. -Sonic NOC

Update 11:10AM: The ATM switch reload is complete and traffic appears to be returning to normal. If you continue to have DSL sync-no-surf connectivity issues, please contact our tech support.

Large Outage in San Francisco

We’re currently experiencing a rather large outage in San Francisco, presumably caused by a power failure to some of our colocation space there. All available resources are being brought to bear, and we’ll have more shortly. This outage is likely impacting a large portion of our DSL customers as well as some Dial, Biz-T, FRATM, and other services. -Nathan, Jared, and the rest of the NOC

Update 8:25AM: We have confirmed that the issue is caused by a power failure. Apparently the facility at 200 Paul Avenue is currently without utility power. The vast majority of the site’s generators, ATSes, and UPS systems worked properly, but the UPSes feeding suite 502 (the location of some of our equipment at that facility) did not transfer correctly. More to follow.

Update 9:19AM: Power has been restored to our equipment, and we’re ensuring that everything returns to service cleanly.

Update 9:32AM: Service to the vast majority of our customers has been restored. There appears to be one DSL aggregation device that is still having trouble — we’re taking a look at that now.

Update 10:22AM: All services have been restored. Please let us know if you still have any outstanding issues. We’ll be having a serious chat with our colocation provider about this event, as this outage occurred despite a massive investment in fully redundant A+B power — both of which died simultaneously.

Unexpected DSL Aggregation Router Reboot

At approximately 3:00 AM today, one of our DSL aggregation routers serving customers in the Bay Area suffered what appears to be a software crash. This software crash kept the router out of service until 3:14 AM, at which time it was booted up and handling customer traffic normally.

The exact cause of the software crash is being investigated with the vendor at this time. We are monitoring the router, and all affected customers appear to be back online at this time.

-Jared

Update: The router has unexpectedly crashed again. We are in the process of moving all customers served by this router to a hot spare. Connectivity should be restored shortly.

Update: All affected customers have been moved to the hot spare successfully. All affected customers should be online again at this time.

ATM Customer Aggregation Router Reload

This evening, Tuesday, Dec 29 at 12:01 AM, we will be performing a maintenance reload on our ATM customer aggregation routers. This will result in 5-10 minutes of downtime for Business-T and FRATM customers.

-Jared

Update: The scheduled router reloads have been completed and all affected customers should be back online. Downtime for affected customers was < 10 minutes.

DSL Aggregation Equipment Failure

One of our DSL aggregation routers handling customers in California’s LATA 1 (largely the San Francisco area) started having problems at approximately 10:41pm tonight. Customers experienced a complete service outage until 10:47pm, at which point we were able to restore partial service. We’re presently working to move the affected customer’s service to alternate equipment, which will involve 10-15 minutes worth of additional downtime in the near future. At present, customers on this particular piece of aggregation gear are experiencing ~7% packet loss. We’ll update this entry as our work progresses.

-Nathan, Matt and Jared

Update 11:14pm:

The problem turned out to be a failing ATM port inside of our network. We’ve migrated all traffic to a hot-spare port and service has returned to normal.

Pogowave finishes network address translation migration

Today is the day that the wireless department will turn off network address translation on our last Pogowave station; The Golden Apple Ranch on Fitzpatrick Lane.

This is our largest and most diverse station, featuring three low band access points.

We think that this can be performed in less than 30 minutes of down time.  Thank you for your patience during this transition that will allow for improved customer support for all Sonic.net Pogowave customers.

DHCP Server Failure

This morning at approximately 8 AM our Support staff became aware that one of our DHCP servers was not serving DHCP leases properly, in such a way that our monitoring systems were not able to detect the problem. Our Support team escalated the problem to the Operations staff and we were able to remedy the problem by rebooting the partially-locked server. Service was restored by approximately 8:28 AM. We apologize for this outage and will be looking into ways of extending our monitoring so this unusual type of failure will not be missed again.

-Jared and the Support team

Update: The DHCP server has begun to fail again, so we have removed it from service. DHCP should be functioning normally again. If you are having problems obtaining a DHCP lease, please try rebooting your modem and any computers or routers that are having problems.