[UPDATE : RFO] Known Outage in the Northern California Area.

–START RFO–

Some Sonic fiber customers experienced an internet outage on the evening of Wednesday, April 8th. Below is the Reason For Outage (RFO) write-up, an internal process document which we are sharing to provide some insight into this unprecedented event.

On April 8th, 2020 at 6:44 PM PDT, one of the large core routers in the Sonic network failed. We have redundant equipment that is configured to take over the load immediately. Unfortunately, due to the way the router failed, the failover time to our redundant router was slower than expected and took a bit over forty seconds. During this time, roughly half of the inbound, and a significant majority of outbound internet traffic was stalled. Most data traffic in Northern California was affected, though data traffic on IPBB and in Southern California was unimpacted. Voice services network-wide were also briefly impaired.

Once traffic rerouted around the failed core router, we noticed traffic levels were much lower than expected. We were missing many tens of Gbps of traffic on transit and peering sources combined. Our Network Engineering team began investigating immediately.

After routing stabilized, we began receiving reports via Sonic’s support team of some Fusion Fiber customers in an offline state. After verifying examples, we noticed that affected customer premise equipment was stuck in a DHCPDISCOVER state, unable to obtain an IP address. Rebooting the home router and ONT (Optical Network Terminal – the devices that terminate the fiber optic connection at the customer premises) did not work to obtain an IP address and restore service.

What we found was that the brief interruption in internet access caused some customer premise devices to re-send large quantities of DHCP messages which overwhelmed their associated customer aggregation routers across our network. This resulted in those aggregation routers becoming overloaded and unable to hand out new IP addresses. Customers that had an IP address before the outage and did not release the address during the minute-long outage window were unimpacted.

Throughout the evening, the Network Engineering team worked to isolate the cause and mitigate the issue. We were concerned that any heavy-handed approaches would cause a larger outage for the other customers that were not affected. We found a potential fix for this by restarting an internal process on the aggregation routers to clear out DHCP queues that were full due to the flood of requests. However, while this resolved the issue for customers in some regions, it wasn’t enough for some heavily-loaded devices.

To fully resolve the issue the Network Engineering team then deployed a configuration change to severely bottleneck the inbound DHCPDISCOVER messages. That slowed the flood enough for even the most populated aggregation routers to slowly start handing out IP addresses to customer home routers without being overwhelmed. The last few users were brought back up shortly after midnight. Coincidentally, this is also around the time when we were able to replace the trigger for all of this – the core router mentioned earlier.

We have since reached out to the vendor we use for the customer premise equipment affected in this outage. Their devices detect internet outages within 30-45 seconds and will then re-request DHCP. This is close to the 40 seconds that it took for the aforementioned failover to take place and is the likely cause why so many devices in our network started to re-DHCP and thus triggered the congestion that prevented customer routers from coming back online.

Moving forward, we are now finalizing fixes to our aggregation routers to reduce the impact of similar events.  We’re also investigating the reason for the long core router fail-over time to reduce the impact of any similar issues. Finally, we are having an ongoing post-mortem discussion with our vendor architects and engineers on this incident and will be seeking ways to prevent a similar issue from happening again.

–END RFO–

 

Update: 11:59pm We believe all services have been restored. If you are still impacted or otherwise experiencing service issues please contact our technical support team tomorrow morning.

Update: 11:30pm We have begun to restore regions in a staged fashion as we carefully monitor progress.

Update: 10:54pm We continue working on resolving the issue.

Update: 10:24pm We continue to work on resolving the issue.

Update: 9:54pm We are still working on resolving the issue.

Update: 8:53pm We currently have our network operations team looking into the DHCP issues (we started around 7pm tonight) As of right now there is no new information. We will be posting updates once new information is given to us. Thank you for your patience!

As of 7:00pm we are experiencing issues with our network that is causing DHCP issues. This is causing customers to not be able to surf or access the internet. Currently we are only seeing this issue on our Sonic Fiber services. We thank you for your time as we work to resolve this.

 

 

103 comments for “[UPDATE : RFO] Known Outage in the Northern California Area.

  1. Thanks. Corner of Acton and Ashby is out of service.
    Hopefully this can be fixed soon.
    My wife is telecommuting for work and needs the internet.
    Cheers!

  2. Thanks for all your hard work. This is the only outage I’ve had since moving to sonic. You all are doing an amazing job and I’m positive you’re working your butts off to handle this issue. Which in the grand scheme of things is not a big deal. Hope you’re all safe and healthy!

  3. Our internet is down, can’t get thru on support numbers. Live in Kensington Ca. We tried rebooting, please advise with a technical support person available.

  4. Can you please notify customers when things like this happen? I didn’t get any email or error app notification or text. Would have saved me 45 min of resetting and troubleshooting our equipment, looking for updates, etc. Thanks!

  5. We’re in Potrero Hill neighborhood of sf. Any idea when internet is likely back up?

  6. When will this issue be resolved as my kids have class tomorrow online due to the Covid19 pandemic therefore cannot be in school since all schools are cancelled for in person studies.

  7. Is DCHP a thing? Do you mean DHCP? Can you tell difference? Maybe this is why our network is down.

  8. Any idea on an eta for a fix? Shelter in place with no Netflix is not as much fun

  9. Working on getting things restored, Joseph! Once we have more information we’ll post here.

  10. We hear where you’re coming from Brian. No ETA yet, but our entire engineering team is working on it – all hands on deck. Hopefully we’ll get you up and running soon!

  11. No ETR yet, but we’re working on it. Sorry about the inconvenience, Vipul, but bare with us and we’ll get your connection back up as soon as we’re able!

  12. Chag sameach, Debbie! We’re working to get things back up and running for your family as quickly as we can!

  13. Thanks for the transparent communication and after hours work to get this fixed!

  14. That’s the goal, Dylan, and normally we’re able to communicate outages or maintenance directly to our users. Unfortunately, that’s not always the case, but we’re working on it. As soon as we have more information we’ll post an update. Thank you for your patience!

  15. Not yet, Jon. Our entire engineering team is working on this though, so we’re hoping to get you back up ASAP!

  16. We want you to have it, and as soon as that’s possible you’ll be back up and running!

  17. Thank you for your efforts to restore service. I can imagine that it would be an intense time to be addressing an outage. Be well.

  18. We don’t have an estimated time of repair as this is an unintended network outage. Our entire engineering team is working to get things back up and running, and we’re doing everything we can to restore service as soon as possible.

  19. You got it Matt! Thanks for your patience, we’ll get you up and running soon!

  20. Thanks for trying guys! I just wanna day thank for the open communication. Hope everything is ok with your engineering team and everyone is staying safe in this trying time!

  21. Hi Jason, there’s currently a network outage in your area. We’re working to get things repaired, bare with us!

  22. I live in a house in the Mission SF with 3 separate Units each with their Sonic Fiber Account. The top two units both still have service while I do not since around 7pm. Equipment failure or server issue?

  23. Most of us aren’t done with it yet either, and that’s why we’re here after hours trying to get service restored!

  24. What would be helpful, at least from a communications perspective, would be to commit to providing updates regularly, say every 30 minutes. Sometime those updates might be be no progress, other times something like we’ve escalated to CISCO (or whomever) and . . . , and eventually a “we’ve identified root cause, and the ETR is . . .”. The steady flow of information would save many phone wifi browser refreshes 🙂

  25. The outage is ongoing, and we’re still working to get things restored. Sorry about that, Andrew!

  26. That’s a going back to X finity this is the 2nd time in a couple of months with no notifications

  27. The call volume is so high that there might be issues reaching our support team. We’re working to get service restored ASAP, sorry for the inconvenience.

  28. There is currently no ETR. All of our network engineers are working to get service restored – thank you for your patience!

Leave a Reply

Your email address will not be published. Required fields are marked *

*