Sonic Status April

buzz.sonic.net, one of our mail servers had…

April 25, 2001

Wed Apr 25 09:32:11 PDT 2001 — buzz.sonic.net, one of our mail servers had its internal clock set incorrectly. You may have received some e-mail that was dated 3/13/1997. The clock has been set and it working now. Sorry for any problems this may have caused. -Steve

Night Operations Complete.

April 25, 2001

Wed Apr 25 01:44:46 PDT 2001 — Night Operations Complete. We identified that the secondary MSM in the Extreme switch appears to have failed and may have been the source of the unexpected downtime. We also reworked the switches configuration to make sure that there are no errors and that it should function properly. While the switch was being worked on our entire network was down. The blackout lasted about 20 minutes. It should be noted that once we have migrated to our dual-switch redundant core network, that outages like this will have minimal effect on our network. -Scott and Kelsey

Night Operations: At approximately 1:00 AM…

April 24, 2001

Tue Apr 24 10:33:03 PDT 2001 — Night Operations: At approximately 1:00 AM Wednesday morning we will be taking Ape, the switch that failed last week, offline to finish restoring all of it’s functionality. The total downtime should be around 10 minutes but may take longer if we run into trouble. -Kelsey and Scott.

Our Redback DSL router locked up and we are…

April 24, 2001

Tue Apr 24 14:24:57 PDT 2001 — Our Redback DSL router locked up and we are in the process of rebooting it now, this will mean about 5 to 10 minutes of down time for our DSL customers. -Steve and Kelsey

Update: The SMS crashed during normal operational procedures. We were able to capture the ‘core dump’ and will be sending it off to RedBack Networks for analysis. This outage affected all of our PacBell DSL, Broadlink DSL and FRATM customers. Total downtime didn’t exceed 10 minutes. If your circuit is still down in a few minutes, reboot your CPE and, if that doesn’t resolve the problem, give support a call. -Kelsey

We’ve been seeing what appears to be denial…

April 20, 2001

Fri Apr 20 16:49:56 PDT 2001 — We’ve been seeing what appears to be denial of service (DOS) attacks in our statistics here today. This has caused sluggish performance for a few brief intervals today. We’re applying some filters to our outbound links to prevent us becoming a source of spoofed IP attacks, and if we see additional traffic, we’ll try to isolate the source and nail this down. -Dane, Nathan, Scott and Kelsey

Public MySQL Server: A kernel that we…

April 18, 2001

Wed Apr 18 17:42:29 PDT 2001 — Public MySQL Server: A kernel that we installed a few nights ago showed some signs of instability and we took the box down to replace it with the old stable kernel. Just to be safe we also verified the integrity of all of the SQL databases on it and this delayed it from coming back on line sooner. It was offline for about 15 minutes. There are just a few tools at sonic that depend on this server, twig and the pop finder being two of them. All of the customer hosted MySQL database are also hosted on this server. No data was lost. We will be investigating the problem with the new kernel and, after fixing it, will upgrade to it during the next maintenance window. -Kelsey

We just experienced an odd set of…

April 17, 2001

Tue Apr 17 11:51:41 PDT 2001 — We just experienced an odd set of circumstances which caused outbound email from customers to be delayed. If you were trying to send email and found it to time out, please do send/receive again to dump the messages in your out-box. No email was lost. Downtime for actual transmission of email was about fourteen minutes.

Our four primary mail servers, hosting SMTP (outbound) and POP (inbound) mail are hosted behind a load balanced switch designed to prevent single point failures impacting end-users. However, due to the reboot of a secondary nameserver, we found a set of conditions that could trigger a failure. Each mail server uses at least two nameservers, but the primary one on all four mail servers was rns2.sonic.net, 208.201.224.33. When this system was undergoing maintenance, all four mail servers fell back to their secondary for reverse DNS lookups, rns1.sonic.net, 208.201.224.11, but took an extra 30 seconds for each new connection to fall back. This caused the Alteon load balancing switch to mark the mail server as unresponsive. With all four running slow due to the DNS server being down, the Alteon effectively shut down SMTP services.

To prevent this possibility in the future, the four email servers now use different primary and secondary DNS search orders. We’ve also asked Alteon for changes to the health monitoring where if all servers for a particular service are slated to be removed from service, it performs more lenient health checks on them to see if they’re just running slow.

-Dane, Russ, Kelsey, Scott and Eli

Some web sites were experiencing cgi program…

April 17, 2001

Tue Apr 17 09:55:16 PDT 2001 — Some web sites were experiencing cgi program issues this morning, thunder.sonic.net is one of our 3 web servers, had lost its bind to the yp server. This prevented cgi-wrapped programs from executing. After a quick reboot the yp server came back up. This only affected cgi-wrapped scripts and lasted a very short period of time. -Steve

Catastrophic core switch failure.

April 17, 2001

Tue Apr 17 16:34:19 PDT 2001 — Catastrophic core switch failure. During routine maintenance, our core Extreme Networks Black Diamond 6800 switch failed. This $120,000 bit of equipment transports almost all network traffic, and without it, we’re totally dead in the water. It’s redundant core management switch module did not successfully take over for some reason, and we’ll be meeting with Extreme to ask them to explain exactly what that $22,000.00 investment was worth to us.

Downtime began at 3:28pm, and lasted 47 tense minutes. During this time, all network services were unavailable. Seven operations team members franticly dove into the guts of the switch, and in the end, a factory default boot with minimal configuration was used to bring over a recently stored config from our main admin server. This config was brought online and the rest of the network was booted. Meanwhile, back office staff pitched in with technical support, and hold times were kept under a minute.

We apologize for the service interruption of 47 minutes, and we’ll be doing a post-mortem shortly to determine what changes we can make to prevent this from ever happening again. We’ll be hauling Extreme in to answer for their equipment. I will post an update myself here when we’ve taken final steps to assure that this can’t happen again.

Sonic.net has made large investments in network redundancy, but it’s been difficult for us to isolate all potential failures. Our operations group will work hard to assure that we nail them all down, and we will “fire drill” our network with simulated failures in order to prove to ourselves that it will not break.

Thank you for your understanding and patience. -Dane, Scott, Kelsey, Eli, Nathan, Scooter, Russ, Steve, Chris, and the entire tech staff.

Alcatel ADSL Modem vulnerability.

April 13, 2001

Fri Apr 13 16:52:25 PDT 2001 — Alcatel ADSL Modem vulnerability. While there is currently no known exploit of the Alcatel ADSL Modem vulnerability, we have implemented safeguards to prevent compromise of customer Alcatel ADSL Modems. These include rejecting access to UDP port 7, as well as denying transit of packets with a source address of all 1’s. Please note that the UDP echo service (port 7) has nothing to do with ping, which uses ICMP echo packets. More information about the vulnerability can be found at the San Diego Supercomputing Center:

security.sdsc.edu/self-help/alcatel/

-Scott and Dane