Widespread network outage.

Fri Jan 4 19:53:00 PST 2008 — Widespread network outage. At approximately 5pm today we logged a massive amount of inbound traffic headed toward one of the colocation customers in our Santa Rosa datacenter. This distributed denial of service attack (DDoS) consisted of well over a gigabit of traffic aimed at this customer, sourced by thousands of zombie computers likely part of a massive botnet. This attack caused two of our gigabit transit links to flap wildly, which caused routing instability inside and outside of our network. This flapping was curtailed by a controlled shutdown and bring-up of these transit links. During this attack, most traffic continued to flow normally, but connectivity to some sites was significantly degraded or unavailable.

Further complicating matters was the rather confusing loss of a Santa Rosa datacenter router. In the middle of the DDoS, one of the two core routers that services our Santa Rosa datacenter suffered a hard drive failure. In addition to contributing a bit of red herring to the mess, this router seems to have spewed some incorrect routing information during the confusion, further complicating our restoration. At this time the router is still down pending hardware replacement. We’ve got on-site spares for this unit, and will be swapping them in around midnight tonight during a maintenance window. There are no customers directly connected to this router, and it’s set up with a redundant neighbor that can take over its duties as necessary. No customers are affected by this router being off-line.

As if that wasn’t enough, one of our network engineers made an unfortunate typo in the heat of battle, the end result of which was a nearly network-wide loss of routing protocol packets. This occurred at around 6:20pm, after internet-wide connectivity was almost fully restored. Emergency roll-back procedures were set into motion, and rapid service restoration required usage of our out-of-band management system to remotely console the affected devices and deactivate the change. Even with these procedures, fully restoring network connectivity took around 25 minutes.

We’ll be discussing this outage at length internally to put policies and procedures in place to prevent any possibility of recurrence, as well as investigating why the routing instability caused such an impact to our network core. Our apologies for the downtime!

-Nathan, Jared, Matt, and the Sonic.net NOC

Leave a Reply

Your email address will not be published.

*