Sonic Status

Broadlink TMAC tower update.

October 26, 2004

Tue Oct 26 11:02:56 PDT 2004 — Broadlink TMAC tower update. Broadlink reports an ETR of 1 hour, they are on site and working with Cisco engineers to repair the outage. -Scott, and Linda and Jason (from Broadlink)

Value Hosting migration to new web cluster.

October 26, 2004

Tue Oct 26 10:43:17 PDT 2004 — Value Hosting migration to new web cluster. Tomorrow, October 27th, at 11am, we will swing Value Hosting sites to the new web cluster. This should be a seamless migration.

Value Hosting customers using FrontPage will find their site migrated to the latest FP server extensions, which means their sites will now be compatible with FrontPage 2003.

After the migration, we will send an email to Value Hosting customers inviting them to test their web sites. We encourage these customers to test their sites at that time.

If you have questions or concerns regarding this migration, please visit the news://news.sonic.net/sonic.help.www newsgroup, or email support.

-Scott

Broadlink TMAC backhaul down.

October 26, 2004

Tue Oct 26 10:09:37 PDT 2004 — Broadlink TMAC backhaul down. At 9:43am today, the backhaul serving Broadlink customers in North Santa Rosa stopped passing traffic. Broadlink technicians are working on restoring service at this time. No other services should be affected, and we will update this status message as soon as possible. — John F

Apache patched.

October 26, 2004

Tue Oct 26 16:19:59 PDT 2004 — Apache patched. Apache on our new web cluster is now patched to guard against a recently-released exploit. No downtime was incurred with the upgrade, and the exploit wasn’t used against us. -Scott and Augie

Route flap.

October 26, 2004

Tue Oct 26 14:24:43 PDT 2004 — Route flap. At 1:40pm, we had a layer 2 link to San Jose Equinix flap, causing momentary unreachability to some Internet destinations. -Scott and John

Network Maintenance.

October 25, 2004

Mon Oct 25 17:28:00 PDT 2004 — Network Maintenance. Tonight at Midnight tonight we are going to be replacing some forwarding hardware on one of our new routers in San Francisco. While the upgrade is being conducted there will be degraded performance and potentially a few short service interruption as routing protocols converge. -Kelsey, Nathan and John

Update: We have confirmed that our previous issues were due to a bad forwarding card. After a quick replacement, everything looks great. The replacement itself required moving traffic off the router, which resulted in around 45 seconds of degraded service to a subset of Internet destination. This completes the first phase of Sonic.net’s new network build out — as of tonight, we have over 1.2 Gbps of capacity out of Santa Rosa! -Nathan and John

UFO/UUNET route flap caused temporary…

October 24, 2004

Sun Oct 24 10:12:23 PDT 2004 — UFO/UUNET route flap caused temporary instability. This morning from about 8:15 AM to 8:30 AM our route to UFO/UUNET flapped a few times causing connectivity problems to many sites. -Kavan

DSL Maintenance.

October 21, 2004

Thu Oct 21 14:32:55 PDT 2004 — DSL Maintenance. On Saturday morning at 1 am we will be performing maintenance on DSL equipment located in San Francisco. DSL customers served out of San Francisco will experience a brief interruption in service while the DSL router reboots. -Operations

SpamAssassin Hiccup.

October 20, 2004

Wed Oct 20 11:37:14 PDT 2004 — SpamAssassin Hiccup. Starting at approximately 2:00AM this morning 2 of our SpamAssassin servers acted up causing some percentage of mail to pass through unfiltered. We resolved the issue as soon as it was brought to our attention. -Kelsey

(Updated) Power outage in Santa Rosa.

October 19, 2004

Tue Oct 19 08:17:40 PDT 2004 — (Updated) Power outage in Santa Rosa. During the storm, we suffered a brief power outage, which triggered a failure in our Leibert UPS battery plant. This resulted in downtime for most of our services, beginning at 5:57AM, and lasting almost an hour. Email took a bit longer to bring online, and is available now – no email was lost, and any pending deliveries were queued.

Currently we’re running with the UPS bypassed and with our diesel generator running on standby in case of another utility outage. We will be working with Leibert service to determine why the UPS failed – our experience with the Leibert has been very, very disappointing, and while I normally hesitate to name a vendor here in the MOTD, I’m making an exception in this case.

Update: We’re back on UPS, after finding and eliminating a bad battery cell. We also found three other batteries which were dropping below the ideal voltage when a load was applied. All of these batteries were recently tested, so we’re investigating why we’ve got such a large number of battery failures.

Subsequent to this outage, we’re working up a list of lessons learned, and items for our to-do list, to avoid the impact of this type of failure in the future. Here’s a few of the items we’re beginning work on now:

Deploy additional battery cabinet in parallel – cost, around $17,000. Coordinate with Leibert and test to assure that system is RELIABLE! Schedule and execute full live system tests during 3AM maint window. Increase descriptiveness of text paging, and the number of people paged. Add UPS contact monitoring via Honeywell system and manned call center. Deploy RADIUS and DNS servers in San Francisco datacenter for redundancy. Test network routing absent Santa Rosa datacenter to assure reachability. Deploy an additional UPS here for systems and routers with dual inputs. Add UPS to voicemail system, and increase runtime on phone system UPS. Drill process for customer notification and internal response.

We are committed to learning from this failure, and to making changes to assure that this type of outage doesn’t occur again. I’m very sorry about the downtime. -Dane