Some Sonic fiber customers experienced an internet outage on the evening of Wednesday, April 8th. Below is the Reason For Outage (RFO) write-up, an internal process document which we are sharing to provide some insight into this unprecedented event.
On April 8th, 2020 at 6:44 PM PDT, one of the large core routers in the Sonic network failed. We have redundant equipment that is configured to take over the load immediately. Unfortunately, due to the way the router failed, the failover time to our redundant router was slower than expected and took a bit over forty seconds. During this time, roughly half of the inbound, and a significant majority of outbound internet traffic was stalled. Most data traffic in Northern California was affected, though data traffic on IPBB and in Southern California was unimpacted. Voice services network-wide were also briefly impaired.
Once traffic rerouted around the failed core router, we noticed traffic levels were much lower than expected. We were missing many tens of Gbps of traffic on transit and peering sources combined. Our Network Engineering team began investigating immediately.
After routing stabilized, we began receiving reports via Sonic’s support team of some Fusion Fiber customers in an offline state. After verifying examples, we noticed that affected customer premise equipment was stuck in a DHCPDISCOVER state, unable to obtain an IP address. Rebooting the home router and ONT (Optical Network Terminal – the devices that terminate the fiber optic connection at the customer premises) did not work to obtain an IP address and restore service.
What we found was that the brief interruption in internet access caused some customer premise devices to re-send large quantities of DHCP messages which overwhelmed their associated customer aggregation routers across our network. This resulted in those aggregation routers becoming overloaded and unable to hand out new IP addresses. Customers that had an IP address before the outage and did not release the address during the minute-long outage window were unimpacted.
Throughout the evening, the Network Engineering team worked to isolate the cause and mitigate the issue. We were concerned that any heavy-handed approaches would cause a larger outage for the other customers that were not affected. We found a potential fix for this by restarting an internal process on the aggregation routers to clear out DHCP queues that were full due to the flood of requests. However, while this resolved the issue for customers in some regions, it wasn’t enough for some heavily-loaded devices.
To fully resolve the issue the Network Engineering team then deployed a configuration change to severely bottleneck the inbound DHCPDISCOVER messages. That slowed the flood enough for even the most populated aggregation routers to slowly start handing out IP addresses to customer home routers without being overwhelmed. The last few users were brought back up shortly after midnight. Coincidentally, this is also around the time when we were able to replace the trigger for all of this – the core router mentioned earlier.
We have since reached out to the vendor we use for the customer premise equipment affected in this outage. Their devices detect internet outages within 30-45 seconds and will then re-request DHCP. This is close to the 40 seconds that it took for the aforementioned failover to take place and is the likely cause why so many devices in our network started to re-DHCP and thus triggered the congestion that prevented customer routers from coming back online.
Moving forward, we are now finalizing fixes to our aggregation routers to reduce the impact of similar events. We’re also investigating the reason for the long core router fail-over time to reduce the impact of any similar issues. Finally, we are having an ongoing post-mortem discussion with our vendor architects and engineers on this incident and will be seeking ways to prevent a similar issue from happening again.
Update: 11:59pm We believe all services have been restored. If you are still impacted or otherwise experiencing service issues please contact our technical support team tomorrow morning.
Update: 11:30pm We have begun to restore regions in a staged fashion as we carefully monitor progress.
Update: 10:54pm We continue working on resolving the issue.
Update: 10:24pm We continue to work on resolving the issue.
Update: 9:54pm We are still working on resolving the issue.
Update: 8:53pm We currently have our network operations team looking into the DHCP issues (we started around 7pm tonight) As of right now there is no new information. We will be posting updates once new information is given to us. Thank you for your patience!
As of 7:00pm we are experiencing issues with our network that is causing DHCP issues. This is causing customers to not be able to surf or access the internet. Currently we are only seeing this issue on our Sonic Fiber services. We thank you for your time as we work to resolve this.