Category: Uncategorized

Mail Service Interuption

One of four NFS filers that supports our backend mail spool storage suffered from a broken FCAL loop and went off-line at 23:24. While the other three remained in service the mail cluster doesn’t take kindly to this situation and users may have noticed delays, timeouts or other errors while checking their mail. All services were completely restored by and operational by 23:56. The loop was most likely broken by a disk that had failed earlier this evening that was awaiting removal from the system in the morning. This is one of the rare failure scenarios that our clustered filers are not able to handle without manual intervention. -Kelsey and Don

Broadlink WDSL Outage

Broadlink Wireless DSL customers served off one of their main towers in the Santa Rosa area are currently offline, due to a power failure in the area. Broadlink is en route to the tower site with generators, and we expect service to be restored shortly.

Datacenter was on diesel

Our Santa Rosa datacenter facility is currently running on our backup generator due to problems with the automatic power transfer system. While we do not anticipate this will cause any impact for customers, it’s certainly a failure of a sort. Normally, the transfer switch starts the generator during a utility failure, but today it has triggered without cause. We are working to get back onto primary utility now.

Update: We’re back on utility, but managed to cause a fault our air handlers in the process of going back and forth multiple times. This bumped temperature in the datacenter from the typical 69 degrees up to over 90. All systems are back online now, and temperatures are now close to normal. There was no customer impact during this partial failure. -Dane, Russ, Kelsey, Don, Nathan and Jen

Update: ATS Fault Analysis and Repair.  After reviewing the situation with a GE/Zenith support tech last night it was concluded that the modular timer responsible for automatic genset exercising and transfer tests was triggering the erroneous transfers to emergency power.  This timer was ‘off’ when it failed and, indeed, never been used as we prefer to manually initiate our weekly genset exercise tests.  The faulty timer has been removed and we have every confidence that our ATS will work as expected from now on.  -Kelsey, Russ and Nathan

Sebastopol DSL Outage

Hardware failure on the Sebastopol Central Office has caused DSL customers to lose connectivity. We are working with AT&T to resolve the issue, and hope to have an estimated time of repair shortly.

-Adam, John and Steve

Update: As of 5:00PM, service appears to be restored to all affected customers.

DSL Aggregation Router Reboot

The DSL aggregation router that serves DSL to the Chico area rebooted itself approximately 20 minutes ago, causing about 5 minutes of downtime for all DSL customers in that area. Currently all traffic levels and customer connectivity look normal at this time, and we will continue to monitor the router, as well as investigate the cause of the spontaneous reboot.

We apologize for any inconvenience this outage may have caused.

-Jared

Webmail IMAP performance problems solved.

Separate from our earlier post about slow imap.sonic.net performance (http://corp.sonic.net/status/2008/09/26/imap-performance-problems-solved/) – we have also received reports of slow Webmail IMAP performance and timeouts when Customers were using the Webmail clients on http://webmail.sonic.net.

We believe we have isolated the problem, which was a bug in our IMAP Proxy software, and have not received any reports of new problems since the beginning of the week when we implemented a fix for the problem software.

If you see timeouts when using http://webmail.sonic.net, please contact Technical Support (support@sonic.net or 1.707.547.3400) immediately, and provide the error message you receive and the time at which the problem occurred.

Webmail Web Site Time-Warp.

A misguided attempt to update some software on our Webmail Cluster inadvertently took the software, associated web pages, and server configuration back to January of this year.

As a result Customers would have seen inconsistent or broken behavior while trying to access the website from around 2:30am to 8:00am, at which point the data was restored from backups.

We applogize for any inconvenience this caused to our Customers; we will be reviewing our documented procedures so that this type of mistake does not occur in the future.

–Augie

Emergency Router Maintenance.

At 3:40PM this afternoon we will be performing an emergency router reload on one of our ATM customer aggregation routers. All connected Business-T and FRATM customers will experience approximately 5 minutes of downtime during the reload. -Tim, Nathan, Matt and Jared

IMAP performance problems solved.

We believe we have fixed the recent IMAP performance problems; let us know if you have any trouble with imap.sonic.net still.

For the past few days we have received a number of reports from customers concerned about periodic slow IMAP performance to imap.sonic.net; the problem would manifest as a mail client behaving very slowly when reading e-mail or timing out all together.

The periodic nature of the problem made it very difficult to troubleshoot as by the time a customer would call to alert us to the problem it would then mysteriously go away.

We believe we have solved the problem and have since implemented much more detailed monitoring of the imap.sonic.net IMAP cluster.

Note: we have seen a handful of reports of slow Webmail performance; those servers are completely separate from the imap.sonic.net cluster, so if you are seeing Webmail performance problems, please note the time it occurred, the error you saw if any and send an e-mail to support@sonic.net .

Mail Service Interuption.

Early this morning around midnight one of our clustered NetApp filers suffered a critical failure which caused one of its Ethernet Interfaces to lock up.  Some customers may have noticed timeouts or other errors when while trying to check their Mail. The total downtime for the service was around 15 minutes. We will be investigating the problem further with our vendor in the waking hours.  –Augie and Don.