Sonic Status March

The Redback router we use to serve Pacific…

March 30, 2001

Fri Mar 30 22:07:02 PST 2001 — The Redback router we use to serve Pacific Bell ADSL and Broadlink WDSL customers had to be rebooted this evening, as it seemed to be behaving inconsistently. As it turned out, the RedBack was not at fault. Total downtime for customers was about three minutes or less. – Eli

CW and UUNet T3’s.

March 29, 2001

Thu Mar 29 12:12:51 PST 2001 — CW and UUNet T3’s. In addition to our recent increase in bandwidth on our UUNet T3, we have increased our available bandwidth on our Cable & Wireless T3. We ordered these increases to meet customer demand for more bandwidth; additionally, Usenet feeds are consuming quite a bit of bandwidth despite our satellite feed.

The entire “bandwidth equation” is under review here at Sonic.net. We have initiated a bandwidth profiling project, and are exploring new ways to bring content to Sonic.net members without using the expensive Internet backbone links. -The Sonic Operations Team

Mistral upgraded.

March 29, 2001

Thu Mar 29 11:48:56 PST 2001 — Mistral upgraded. Yesterday, we upgraded one web server in our cluster to a dual PIII-750 with half a Gigabyte of RAM. Mistral used to be a single PIII-550 with a quarter Gigabyte of RAM. (Thanks Kelsey.) Our web server cluster serves over 5.1 million hits a day — and during peak periods, hits-per-second can exceed 115. We have these statistics, and more, available via graphs; Nathan and Kelsey are working to make those graphs public within the next week. -The Sonic Operations Team

bolt.sonic.net net, our linux shell server…

March 28, 2001

Wed Mar 28 10:02:42 PST 2001 — bolt.sonic.net net, our linux shell server stopped talking to our mail server lastnight. This resulted in some e-mail being delayed that was sent via the shell. This did not affect dialup users e-mail, only e-mail that was sent via bolt. -Steve, Nathan

Ape, or core Extreme Networks Switch, failed…

March 22, 2001

Thu Mar 22 21:51:58 PST 2001 — Ape, or core Extreme Networks Switch, failed tonight at 20:26:47. It’s snmpd process got stuck in a software loop and maxed out the management modules’ CPU and routing engine. We rebooted the switch just after 21:00 which cleared up the problem and restored all services. While the switch was in failure mode it’s routing engine was unresponsive which primarily affected our colocated customers. It may have also sporadically failed to properly forward normal switch traffic during this period as well which would have resulted in network reachability errors.

We identified the trigger for the bug in the switch’s software and stopped probing that SNMP OID and have already submitted a bug report to Extreme Networks. -Kelsey, Scott and Eli.

Last night one PRI in the 9811 dialup group…

March 20, 2001

Tue Mar 20 14:40:30 PST 2001 — Last night one PRI in the 9811 dialup group went down due to an OC3 circuit failure at Focal Communications. Focal was able to work with Pacific Bell and Sprint to get the PRI back online by 8:00pm. This may have caused a short period of time that the 9811 modem pool returned busy signals. We have added more capacity to this and the Sebastopol dialup groups to increase reliability in in those groups. — Steve

PG&E rolling outages.

March 20, 2001

Tue Mar 20 14:25:06 PST 2001 — PG&E rolling outages. Sonic.net’s facility has been reclassified into a zone eligible for rolling blackouts. While we were previously in a “50” zone, we’ve been moved into zone 2 as part of PG&E’s efforts to address their energy shortfalls. In response to this, we’ve today deployed additional battery capacity at a cost of around $12,000, plus we have a diesel generator here on site to keep us going beyond the battery runtime. -Dane, Eli, Kelsey, Russ and Scott

Unexpected Network Outage.

March 16, 2001

Fri Mar 16 16:56:35 PST 2001 — Unexpected Network Outage. We have been working with Extreme Networks to troubleshoot and identify some problems that we’ve had with VLANs in their switches not behaving as expected under certain circumstances. We were gathering a packet dump to help their engineers resolve the problem when we unexpectedly caused a switch loop between giga and ape (our two core switches.) We were able to resolve the loop in a matter of seconds but our Alteon load balancing switch decided to down it’s interfaces into ape making all web and email servers unavailable. It took a few minutes to coax the Alteon to bring it’s interfaces back up and then a few minutes more for the server network to converge.

All locally served web sites and email services were offline for five minutes. No other services were affected. -Russ, Nathan and Kelsey

DSLAM1 in Santa Rosa has had intermittent…

March 16, 2001

Fri Mar 16 16:12:25 PST 2001 — DSLAM1 in Santa Rosa has had intermittent issues over the last few days, but has been operational for the most part. This afternoon, we’ve noticed more significant problems, and we will be working with Pacific Bell’s Repair to determine when the DSLAM will be fully operational. Our DSL Product manager will be tracking the issue all weekend, and coordinating with Support to get news out in a timely fashion. – Sonic Support

DSLAM 1 in the main Santa Rosa CO is down.

March 12, 2001

Mon Mar 12 17:55:55 PST 2001 — DSLAM 1 in the main Santa Rosa CO is down. This DSLAM serves approximately 200 of our ADSL customers in Santa Rosa. PacBell (ASI) has not provided us with an ETR. Don’t forget that you may still dial up using a modem if your DSL is offline. -Dave & Eli