Wed Oct 20 11:37:14 PDT 2004 — SpamAssassin Hiccup. Starting at approximately 2:00AM this morning 2 of our SpamAssassin servers acted up causing some percentage of mail to pass through unfiltered. We resolved the issue as soon as it was brought to our attention. -Kelsey
(Updated) Power outage in Santa Rosa.
Tue Oct 19 08:17:40 PDT 2004 — (Updated) Power outage in Santa Rosa. During the storm, we suffered a brief power outage, which triggered a failure in our Leibert UPS battery plant. This resulted in downtime for most of our services, beginning at 5:57AM, and lasting almost an hour. Email took a bit longer to bring online, and is available now – no email was lost, and any pending deliveries were queued.
Currently we’re running with the UPS bypassed and with our diesel generator running on standby in case of another utility outage. We will be working with Leibert service to determine why the UPS failed – our experience with the Leibert has been very, very disappointing, and while I normally hesitate to name a vendor here in the MOTD, I’m making an exception in this case.
Update: We’re back on UPS, after finding and eliminating a bad battery cell. We also found three other batteries which were dropping below the ideal voltage when a load was applied. All of these batteries were recently tested, so we’re investigating why we’ve got such a large number of battery failures.
Subsequent to this outage, we’re working up a list of lessons learned, and items for our to-do list, to avoid the impact of this type of failure in the future. Here’s a few of the items we’re beginning work on now:
Deploy additional battery cabinet in parallel – cost, around $17,000. Coordinate with Leibert and test to assure that system is RELIABLE! Schedule and execute full live system tests during 3AM maint window. Increase descriptiveness of text paging, and the number of people paged. Add UPS contact monitoring via Honeywell system and manned call center. Deploy RADIUS and DNS servers in San Francisco datacenter for redundancy. Test network routing absent Santa Rosa datacenter to assure reachability. Deploy an additional UPS here for systems and routers with dual inputs. Add UPS to voicemail system, and increase runtime on phone system UPS. Drill process for customer notification and internal response.
We are committed to learning from this failure, and to making changes to assure that this type of outage doesn’t occur again. I’m very sorry about the downtime. -Dane
Busies on 1001.
Mon Oct 11 19:44:35 PDT 2004 — Busies on 1001. The gear servicing our access numbers ending with 1001 are intermittently returning busy signals. No other dialup groups should be affected. Temporarily switching to another local access number is advisable; use the popfinder at www.sonic.net/popf/ for alternate numbers. – John F
Internal routing issue.
Sun Oct 10 03:31:21 PDT 2004 — Internal routing issue. At approximately 2:40AM, a pair of routers at Sonic.net’s facility in Santa Rosa corrupted the OSPF database for many of their neighbors, we worked to stabilize our network until approximately 3:00AM, at which time full service was restored to all customers. We are on the phone with the equipment vendor at the moment, and are working to determine the bug which caused this failure. This outage affected Sonic.net colocation and DSL customers served out of Santa Rosa. – Nathan, Zeke, and Kevan
Webmail issue solved.
Sat Oct 9 20:56:25 PDT 2004 — Webmail issue solved. A problem with Sonic.net webmail authentication was solved today. One server seemed to be misbehaving, which was rebooted. Additionally, a file share was found to be full, whose size was increased. -Scott and John
Core network upgrade completed.
Fri Oct 8 02:52:40 PDT 2004 — Core network upgrade completed. The migration to the new core switches was completed without a hitch. Over the next few weeks we’ll be continuing the migration to our new gigabit core and gigabit backbone. -Nathan, Kelsey, Zeke and Jared.
Tonight we begin our massive core network…
Thu Oct 7 14:29:49 PDT 2004 — Tonight we begin our massive core network upgrade in Santa Rosa. There will be a number of several-second outages as we migrate between our old and new core switches. Sorry for the inconvenience, but This Is Good. -Nathan, Kelsey, Zeke, Jared and Jared.
Administrative SQL off-line for repair.
Wed Oct 6 17:35:59 PDT 2004 — Administrative SQL off-line for repair. Our administrative SQL server has been taken down for some emergency repair work. This affects a number of things, most notably, access to the member tools, the ability to sign up new accounts and for tech support’s ability to review or create new support tickets. -Kelsey UPDATE – Repair work has been completed successfully. -Kelsey
Accelerated dialup service is now available…
Mon Sep 27 12:19:04 PDT 2004 — Accelerated dialup service is now available to Sonic.net members at no additional charge.
Sonic.net is now offering the Sonic.net Accelerator, an upgrade to your current dialup connection that offers web browsing at up to five times the speed. Best of all, there is no additional charge for this exciting new feature! For dialup customers who have felt a bit left behind with all the focus on DSL and wireless broadband, this is very good news!
The Sonic.net Accelerator improves your web browsing experience by dramatically accelerating the delivery of web pages to your PC. Our unique combination of intelligent browser caching, compression technology, and edge caching enables Web pages to be delivered to your browser with minimal latency at maximum speeds over your existing dial-up connection. As a result, you can expect a very noticeable improvement in web browsing response time. The Sonic.net Accelerator also includes options to automatically block banner ads and pop-up ads, which can contribute to slow web browsing.
For more information and to download the accelerator, visit the web page at:
www.sonic.net/support/accelerator/
-Dane
Mail problems.
Thu Sep 23 07:55:27 PDT 2004 — Mail problems. For about an hour this morning, starting at 6:45, customers were unable to send email via mail.sonic.net. We found a customer with too many connections to the servers causing resource exhaustion. We blocked them and restored service. We will be analyzing the event in order to determine how we can avoid this problem in the future. -Nathan and John