–START RFO–
Some Sonic fiber customers experienced an internet outage on the evening of Wednesday, April 8th. Below is the Reason For Outage (RFO) write-up, an internal process document which we are sharing to provide some insight into this unprecedented event.
On April 8th, 2020 at 6:44 PM PDT, one of the large core routers in the Sonic network failed. We have redundant equipment that is configured to take over the load immediately. Unfortunately, due to the way the router failed, the failover time to our redundant router was slower than expected and took a bit over forty seconds. During this time, roughly half of the inbound, and a significant majority of outbound internet traffic was stalled. Most data traffic in Northern California was affected, though data traffic on IPBB and in Southern California was unimpacted. Voice services network-wide were also briefly impaired.
Once traffic rerouted around the failed core router, we noticed traffic levels were much lower than expected. We were missing many tens of Gbps of traffic on transit and peering sources combined. Our Network Engineering team began investigating immediately.
After routing stabilized, we began receiving reports via Sonic’s support team of some Fusion Fiber customers in an offline state. After verifying examples, we noticed that affected customer premise equipment was stuck in a DHCPDISCOVER state, unable to obtain an IP address. Rebooting the home router and ONT (Optical Network Terminal – the devices that terminate the fiber optic connection at the customer premises) did not work to obtain an IP address and restore service.
What we found was that the brief interruption in internet access caused some customer premise devices to re-send large quantities of DHCP messages which overwhelmed their associated customer aggregation routers across our network. This resulted in those aggregation routers becoming overloaded and unable to hand out new IP addresses. Customers that had an IP address before the outage and did not release the address during the minute-long outage window were unimpacted.
Throughout the evening, the Network Engineering team worked to isolate the cause and mitigate the issue. We were concerned that any heavy-handed approaches would cause a larger outage for the other customers that were not affected. We found a potential fix for this by restarting an internal process on the aggregation routers to clear out DHCP queues that were full due to the flood of requests. However, while this resolved the issue for customers in some regions, it wasn’t enough for some heavily-loaded devices.
To fully resolve the issue the Network Engineering team then deployed a configuration change to severely bottleneck the inbound DHCPDISCOVER messages. That slowed the flood enough for even the most populated aggregation routers to slowly start handing out IP addresses to customer home routers without being overwhelmed. The last few users were brought back up shortly after midnight. Coincidentally, this is also around the time when we were able to replace the trigger for all of this – the core router mentioned earlier.
We have since reached out to the vendor we use for the customer premise equipment affected in this outage. Their devices detect internet outages within 30-45 seconds and will then re-request DHCP. This is close to the 40 seconds that it took for the aforementioned failover to take place and is the likely cause why so many devices in our network started to re-DHCP and thus triggered the congestion that prevented customer routers from coming back online.
Moving forward, we are now finalizing fixes to our aggregation routers to reduce the impact of similar events. We’re also investigating the reason for the long core router fail-over time to reduce the impact of any similar issues. Finally, we are having an ongoing post-mortem discussion with our vendor architects and engineers on this incident and will be seeking ways to prevent a similar issue from happening again.
–END RFO–
Update: 11:59pm We believe all services have been restored. If you are still impacted or otherwise experiencing service issues please contact our technical support team tomorrow morning.
Update: 11:30pm We have begun to restore regions in a staged fashion as we carefully monitor progress.
Update: 10:54pm We continue working on resolving the issue.
Update: 10:24pm We continue to work on resolving the issue.
Update: 9:54pm We are still working on resolving the issue.
Update: 8:53pm We currently have our network operations team looking into the DHCP issues (we started around 7pm tonight) As of right now there is no new information. We will be posting updates once new information is given to us. Thank you for your patience!
As of 7:00pm we are experiencing issues with our network that is causing DHCP issues. This is causing customers to not be able to surf or access the internet. Currently we are only seeing this issue on our Sonic Fiber services. We thank you for your time as we work to resolve this.
We’ll be updating this post, as well as @sonic and @sonic_status on Twitter!
We don’t have a timeline for repair just yet, but we’re trying to get service restored ASAP.
We appreciate the support Christopher! These are trying times, but we’ll get through it together!
Pretty sure I reached download speeds over 400 mb/s a second the other day. Y’all take your time and thanks for working late to fix this issue
Hi Patrick! There’s currently an outage impacting connectivity in your area. Our team is doing everything they can to get service restored, so hopefully we’ll get you back up and running shortly.
Honestly, you’re absolutely right. We’re working on it though, and hope to get service restored soon.
I echo Christopher Washington. Stay safe and thanks for your hard work getting us back up and running.
wow Amy is harsh! What a time to be picking on people’s typing and talking sh*T. Clearly she has some connectivity available to her.
We hear you, Chris. This is an all hands on deck outage, and we understand the importance of communication with so many people relying on connectivity. That said, we simply don’t have the information available to produce worthwhile updates. We appreciate the feedback and we’re working to improve this flow of information. Thanks for your patience, and hopefully we’ll get you reconnected shortly.
@danielthompson, thanks for putting this thread up. Do you have a rough idea when it it will be up like tonight or tomorrow morning or a couple of day, etcétera?
Sorry about that, John! We understand the situation is frustrating, but we’re confident we’ll have you up and running soon.
Thanks for all the replies, Daniel, and to your team for the after-hours work! keep it up!
Where the *#$&? if my fiber on Chenery St and 30th SF? This DSL is too slow 🙂
We will all be fine and up and running soon, I’m sure! No one wants the power to be out, Sonic included. I’m sure they are working as hard as they can on it and it’ll get fixed when it does. For now… we can go old school
I have some important stuff to upload in 5hours. Do you think it will back up by 3:00AM? I need to find out different solution, if it is not.
Thank you guys for working hard to fix this. Us technical people know what a challenge this can be, good luck to you!
My next-door neighbor, whose wifi I’m now piggybacking on, could not believe that his Sonic internet was working while mine was not. It’s interesting. When someone there has a minute, do please explain why that is.
Sonic is an absolutely top rate service. I know that with this outage, there are many Sonic employees working late tonight. I’m grateful for everyone who works to give us internet service.
This is our first difficulty with Sonic: you have been providing amazing service! Really feeling how important internet connection is right now so very grateful for your efforts to save us
I lost my ladder game.. went down 20 points and now am in Diamond.. are you gonna reimburse me those points?! Lol just kidding. I know you guys will have it up soon. Much love
We hope it will be fixed before then, but a contingency plan is never a bad idea.
Ok, thank you so much for all the answers and hard work from your team.
Thanks for the updates! Good luck.
Thanks Susan, bare with us and we’ll have you up and running soon!
We really appreciate the support, Raymond!
Ouch, sorry about the Julian. We’ll get you back up soon to hopefully get you back to your rightful rank 😀
When is soon though?
I don’t have a clear cut answer for you at the moment, Samuel. It appears as though there are other customers in the same situation as you; same residence, or similar proximity, one can surf while the other can’t. The thing with networks is that connectivity doesn’t necessarily rely on geographic location – this is likely less is an issue with your physical area as much as it is the area your service occupies in our network. Once we have resolved the issue, we’ll provide a post-mortem if possible, which should explain the behavior you’ve described.
I’m amazed that the first trouble of any kind I’ve had on our home internet is a network wide outage. this is the sort of admin nightmare I’ve only dreamt of, I’m super impressed to see y’all taking it so seriously! good luck, and *man* am I looking forward to whatever y’all can say publicly about the story, this sounds like such a crazy situation – the entire bay down because of dhcp misconfiguration sounds like it has to have a really interesting backstory.
best of luck, may your backup connections be reliable :p
also, like, no biggie, but hurry up, I wanna play Minecraft xD
We hear you, Kevin. We’re doing what we can to restore service, and hope to get you connected ASAP.
I understand the frustration Greg. We rely on the internet as much as you do, and we’re trying to get it back up and running as quickly as we can.
Thanks for constant updates – appreciate the responsiveness. Sonic has been great so far and this is the first ever issue we have faced. Good luck to your team
“Soon” is as soon as we’re able to address the underlying issue and get you reconnected. Any timeline provided would be disingenuous at this moment, but we’re working on it. Thank you for your patience, Michel.
Thanks Sharat!
I reset my router thinking I was doing something wrong. Hopefully it will be easy to connect again once it’s running. Hopefully this gets resolved but it’s not the end of the world for me. I can have patience though.
This is the first issue we’ve had with Sonic, so we figure they deserve a little forbearance.
Turns out we still have DVDs and Blu-ray disks to play. CDs for music, and even, GASP!, books.
I think we may survive.
I do appreciate the updates though. And we would love to have connectivity back by the morning so my partner can teach her classes at CCSF.
Thanks for the updates, appreciate that your team is working through the issue.
Stay safe and thank you for the frequent updates.
PS: Phil, I’ve clocked over 950 up AND down with a gaming laptop plugged directly into the fiber modem…
Wow! Here’s looking to getting you back up and running at those speeds as soon as we can. Thanks for the support, Luther!
Sonic, thanks for the 10:24pm update. Please provide the next update at 11:00pm,
Thank you,
Lorraine
It’s back up! Thanks Daniel and team for working late and helping us fix this! Great customer service!
Thank you! We have Internet service again.
Thanks for all the hardwork!
Thanks for the hard work guys/gals, know the pressure is up to keep things running stable with increased capacity and operating in this crisis isn’t easy.
We need the internet now more than ever.
Service should be restored now!
Minecraft awaits 🙂
I can’t imagine how much work went into getting our service back. Thank you. This post and its comments were deeply reassuring, simply because it provided some regular reconfirmation that 1) it wasn’t just me, cocooned in social isolation, who’d fallen off the grid, and 2) people were working to reconnect all of us. Before I found the post I spent a couple of hours waiting for a response to my request for customer support via text, which never came. I understand that there wasn’t much they could have told me. But can I respectfully suggest that in the event of a future outage you send periodic updates via text, just so that people know they haven’t been forgotten?
Thank you for addressing the issue so swiftly! We have been so impressed and satisfied as Sonic customers, and in a situation like this you have only delivered greater confidence. Thanks to the entire Sonic team.
Thank you so much for that positive feedback, and I am glad that the service is back up and working properly for you. If you do have any further issues or concerns around your services, please reach out to us (707)547-3400 so we can help you the best we can.
Hey all. I see a lot of these reports are from 4/8. Luckily I didn’t experience any outage yesterday. However today’s date 4/9 and it looks like it is my turn. Just want to report the issue in 94122. I’m sure you are all working on it. Must be a huge load on the system with all the working from home. I’m unable to work as a result but I’m just grateful I still have a job. I know you’re doing yours. Thanks!