Catastrophic core switch failure.

Tue Apr 17 16:34:19 PDT 2001 — Catastrophic core switch failure. During routine maintenance, our core Extreme Networks Black Diamond 6800 switch failed. This $120,000 bit of equipment transports almost all network traffic, and without it, we’re totally dead in the water. It’s redundant core management switch module did not successfully take over for some reason, and we’ll be meeting with Extreme to ask them to explain exactly what that $22,000.00 investment was worth to us.

Downtime began at 3:28pm, and lasted 47 tense minutes. During this time, all network services were unavailable. Seven operations team members franticly dove into the guts of the switch, and in the end, a factory default boot with minimal configuration was used to bring over a recently stored config from our main admin server. This config was brought online and the rest of the network was booted. Meanwhile, back office staff pitched in with technical support, and hold times were kept under a minute.

We apologize for the service interruption of 47 minutes, and we’ll be doing a post-mortem shortly to determine what changes we can make to prevent this from ever happening again. We’ll be hauling Extreme in to answer for their equipment. I will post an update myself here when we’ve taken final steps to assure that this can’t happen again.

Sonic.net has made large investments in network redundancy, but it’s been difficult for us to isolate all potential failures. Our operations group will work hard to assure that we nail them all down, and we will “fire drill” our network with simulated failures in order to prove to ourselves that it will not break.

Thank you for your understanding and patience. -Dane, Scott, Kelsey, Eli, Nathan, Scooter, Russ, Steve, Chris, and the entire tech staff.

Leave a Reply

Your email address will not be published. Required fields are marked *

*