Sun Oct 24 10:12:23 PDT 2004 — UFO/UUNET route flap caused temporary instability. This morning from about 8:15 AM to 8:30 AM our route to UFO/UUNET flapped a few times causing connectivity problems to many sites. -Kavan
Month: October 2004
DSL Maintenance.
Thu Oct 21 14:32:55 PDT 2004 — DSL Maintenance. On Saturday morning at 1 am we will be performing maintenance on DSL equipment located in San Francisco. DSL customers served out of San Francisco will experience a brief interruption in service while the DSL router reboots. -Operations
SpamAssassin Hiccup.
Wed Oct 20 11:37:14 PDT 2004 — SpamAssassin Hiccup. Starting at approximately 2:00AM this morning 2 of our SpamAssassin servers acted up causing some percentage of mail to pass through unfiltered. We resolved the issue as soon as it was brought to our attention. -Kelsey
(Updated) Power outage in Santa Rosa.
Tue Oct 19 08:17:40 PDT 2004 — (Updated) Power outage in Santa Rosa. During the storm, we suffered a brief power outage, which triggered a failure in our Leibert UPS battery plant. This resulted in downtime for most of our services, beginning at 5:57AM, and lasting almost an hour. Email took a bit longer to bring online, and is available now – no email was lost, and any pending deliveries were queued.
Currently we’re running with the UPS bypassed and with our diesel generator running on standby in case of another utility outage. We will be working with Leibert service to determine why the UPS failed – our experience with the Leibert has been very, very disappointing, and while I normally hesitate to name a vendor here in the MOTD, I’m making an exception in this case.
Update: We’re back on UPS, after finding and eliminating a bad battery cell. We also found three other batteries which were dropping below the ideal voltage when a load was applied. All of these batteries were recently tested, so we’re investigating why we’ve got such a large number of battery failures.
Subsequent to this outage, we’re working up a list of lessons learned, and items for our to-do list, to avoid the impact of this type of failure in the future. Here’s a few of the items we’re beginning work on now:
Deploy additional battery cabinet in parallel – cost, around $17,000. Coordinate with Leibert and test to assure that system is RELIABLE! Schedule and execute full live system tests during 3AM maint window. Increase descriptiveness of text paging, and the number of people paged. Add UPS contact monitoring via Honeywell system and manned call center. Deploy RADIUS and DNS servers in San Francisco datacenter for redundancy. Test network routing absent Santa Rosa datacenter to assure reachability. Deploy an additional UPS here for systems and routers with dual inputs. Add UPS to voicemail system, and increase runtime on phone system UPS. Drill process for customer notification and internal response.
We are committed to learning from this failure, and to making changes to assure that this type of outage doesn’t occur again. I’m very sorry about the downtime. -Dane
Busies on 1001.
Mon Oct 11 19:44:35 PDT 2004 — Busies on 1001. The gear servicing our access numbers ending with 1001 are intermittently returning busy signals. No other dialup groups should be affected. Temporarily switching to another local access number is advisable; use the popfinder at www.sonic.net/popf/ for alternate numbers. – John F
Internal routing issue.
Sun Oct 10 03:31:21 PDT 2004 — Internal routing issue. At approximately 2:40AM, a pair of routers at Sonic.net’s facility in Santa Rosa corrupted the OSPF database for many of their neighbors, we worked to stabilize our network until approximately 3:00AM, at which time full service was restored to all customers. We are on the phone with the equipment vendor at the moment, and are working to determine the bug which caused this failure. This outage affected Sonic.net colocation and DSL customers served out of Santa Rosa. – Nathan, Zeke, and Kevan
Webmail issue solved.
Sat Oct 9 20:56:25 PDT 2004 — Webmail issue solved. A problem with Sonic.net webmail authentication was solved today. One server seemed to be misbehaving, which was rebooted. Additionally, a file share was found to be full, whose size was increased. -Scott and John
Core network upgrade completed.
Fri Oct 8 02:52:40 PDT 2004 — Core network upgrade completed. The migration to the new core switches was completed without a hitch. Over the next few weeks we’ll be continuing the migration to our new gigabit core and gigabit backbone. -Nathan, Kelsey, Zeke and Jared.
Tonight we begin our massive core network…
Thu Oct 7 14:29:49 PDT 2004 — Tonight we begin our massive core network upgrade in Santa Rosa. There will be a number of several-second outages as we migrate between our old and new core switches. Sorry for the inconvenience, but This Is Good. -Nathan, Kelsey, Zeke, Jared and Jared.
Administrative SQL off-line for repair.
Wed Oct 6 17:35:59 PDT 2004 — Administrative SQL off-line for repair. Our administrative SQL server has been taken down for some emergency repair work. This affects a number of things, most notably, access to the member tools, the ability to sign up new accounts and for tech support’s ability to review or create new support tickets. -Kelsey UPDATE – Repair work has been completed successfully. -Kelsey