Month: September 2008

Los Angeles DHCP Failure

This morning one of our DHCP servers in our Los Angeles PoP suffered a disk failure that prevented it from storing leases that customers obtained. This should not have affected customers connectivity to the Internet. We have swapped over to spare hardware that we had on site to resolve the issue. A small minority of customers may have experienced a brief hiccup while we made the transition while most customers should not have even noticed the transition. -William and Jared

Sonic’s New Member Tools are now Member Tools!

Today we finished the process of switching our Member Tools system from its old home on http://sonic.sonic.net to https://members.sonic.net/. A few tools still remain on the old server but  https://members.sonic.net/ is now the URL from which you will access all tools.

Moving to our new server will allow us to add new tools more easily and scale out our systems as needed. We’ve also tried to make our tools more reliable and easier-to-use as part of  the move.

If you’ve never tried our tools before, please do check out https://members.sonic.net/. You may be surprised to find how easily they enable you to customize your account to your specific needs.

–Dianne

Mail Cluster Maintenance

Tonight, shortly after midnight, we will replace the failing disk shelf in one of our Network Appliance filers responsible for the POP, IMAP and Webmail service interruptions earlier today. Replacing the shelf is not expected to take more than 30 minutes. During the maintenance users may not be able to check their email. However, new mail will be queued for delivery on our MX cluster and all outbound email will continue to flow unaffected. -Kelsey

Uptdate – The faulty shelf has been replaced and all services have been fully restored.  Total downtime for POP, IMAP and Webmail was less than 20 minutes.  -Kelsey

Multiple Service Interruption

Early this afternoon, we experienced a failure in one of our Netapps which caused some content to be unavailable, and a few key systems to present timeout errors. Downtime is estimated to be between 5 and 7 minutes.  We have identified the problem and restored all services. –William, Augie, and Don

LATA1 AT&T ATM Outage

At 5:03 PM today, all of our DSL and Business-T subscribers in LATA1 went offline. The problem appeared to be internal to AT&T’s ATM network. As of 5:12 PM, the problem in AT&T’s network appears to have been resolved, and DSL and Business-T customers are back online. We are continuing to monitor the situation and work with AT&T to find out what happened in their network. More details will be forthcoming as we get them.

-Jared, and everyone in the NOC and Support

Update: Word from AT&T and ASI (AT&T’s ATM division) is that they suffered a major fiber issue in one of their major hub Central Offices in San Francisco. Traffic has been routed around the problem, and they are currently working on resolving the fiber issue. There is a small chance that traffic may be interrupted as AT&T works on their issue, but they will be doing everything they can to prevent that. We will be monitoring our ATM links very closely for the next 24 hours to mitigate any potential problems.

Update: AT&T has determined that the outage was caused by a failing ATM fabric card. Service was restored when traffic was moved to the spare fabric card. The failed card has been replaced, with no further service interruption.

ATM Switch Maintenance

Tonight at midnight we will be performing potentially invasive maintenance on one of our ATM switches. We will be reloading the primary route processor in the ATM switch and it should automatically fail over to the backup. Impact should be only a few seconds as the failover occurs, though it may be up to 10 minutes if problems are encountered. The ATM switch in question serves DSL and Business-T customers in the Bay Area, Sacramento and Fresno areas.

-Jared and Nathan

Update: The failover to the backup route processor did not go smoothly, so we were forced to reboot the ATM switch chassis. This resulted in approximately 10 minutes of connectivity interruption for customers connected through this ATM switch. We apologize for the interruption. We are monitoring traffic levels across the ATM switch and they appear to be moving back to normal levels. We will continue to monitor the switch closely.