[UPDATE : RFO] Known Outage in the Northern California Area.

April 8, 2020

–START RFO–

Some Sonic fiber customers experienced an internet outage on the evening of Wednesday, April 8th. Below is the Reason For Outage (RFO) write-up, an internal process document which we are sharing to provide some insight into this unprecedented event.

On April 8th, 2020 at 6:44 PM PDT, one of the large core routers in the Sonic network failed. We have redundant equipment that is configured to take over the load immediately. Unfortunately, due to the way the router failed, the failover time to our redundant router was slower than expected and took a bit over forty seconds. During this time, roughly half of the inbound, and a significant majority of outbound internet traffic was stalled. Most data traffic in Northern California was affected, though data traffic on IPBB and in Southern California was unimpacted. Voice services network-wide were also briefly impaired.

Once traffic rerouted around the failed core router, we noticed traffic levels were much lower than expected. We were missing many tens of Gbps of traffic on transit and peering sources combined. Our Network Engineering team began investigating immediately.

After routing stabilized, we began receiving reports via Sonic’s support team of some Fusion Fiber customers in an offline state. After verifying examples, we noticed that affected customer premise equipment was stuck in a DHCPDISCOVER state, unable to obtain an IP address. Rebooting the home router and ONT (Optical Network Terminal – the devices that terminate the fiber optic connection at the customer premises) did not work to obtain an IP address and restore service.

What we found was that the brief interruption in internet access caused some customer premise devices to re-send large quantities of DHCP messages which overwhelmed their associated customer aggregation routers across our network. This resulted in those aggregation routers becoming overloaded and unable to hand out new IP addresses. Customers that had an IP address before the outage and did not release the address during the minute-long outage window were unimpacted.

Throughout the evening, the Network Engineering team worked to isolate the cause and mitigate the issue. We were concerned that any heavy-handed approaches would cause a larger outage for the other customers that were not affected. We found a potential fix for this by restarting an internal process on the aggregation routers to clear out DHCP queues that were full due to the flood of requests. However, while this resolved the issue for customers in some regions, it wasn’t enough for some heavily-loaded devices.

To fully resolve the issue the Network Engineering team then deployed a configuration change to severely bottleneck the inbound DHCPDISCOVER messages. That slowed the flood enough for even the most populated aggregation routers to slowly start handing out IP addresses to customer home routers without being overwhelmed. The last few users were brought back up shortly after midnight. Coincidentally, this is also around the time when we were able to replace the trigger for all of this – the core router mentioned earlier.

We have since reached out to the vendor we use for the customer premise equipment affected in this outage. Their devices detect internet outages within 30-45 seconds and will then re-request DHCP. This is close to the 40 seconds that it took for the aforementioned failover to take place and is the likely cause why so many devices in our network started to re-DHCP and thus triggered the congestion that prevented customer routers from coming back online.

Moving forward, we are now finalizing fixes to our aggregation routers to reduce the impact of similar events. We’re also investigating the reason for the long core router fail-over time to reduce the impact of any similar issues. Finally, we are having an ongoing post-mortem discussion with our vendor architects and engineers on this incident and will be seeking ways to prevent a similar issue from happening again.

–END RFO–

Update: 11:59pm We believe all services have been restored. If you are still impacted or otherwise experiencing service issues please contact our technical support team tomorrow morning.

Update: 11:30pm We have begun to restore regions in a staged fashion as we carefully monitor progress.

Update: 10:54pm We continue working on resolving the issue.

Update: 10:24pm We continue to work on resolving the issue.

Update: 9:54pm We are still working on resolving the issue.

Update: 8:53pm We currently have our network operations team looking into the DHCP issues (we started around 7pm tonight) As of right now there is no new information. We will be posting updates once new information is given to us. Thank you for your patience!

As of 7:00pm we are experiencing issues with our network that is causing DHCP issues. This is causing customers to not be able to surf or access the internet. Currently we are only seeing this issue on our Sonic Fiber services. We thank you for your time as we work to resolve this.

← Brief Outage

Support Callbacks Requested 4/8 Purged →

103 comments for “[UPDATE : RFO] Known Outage in the Northern California Area.”

Jason M Surles says:

April 8, 2020 at 7:29 pm

I’m down
Andrew Leith says:

April 8, 2020 at 7:36 pm

Can’t connect still at 7:35 in Potrero hill sf
Stuart Rosenthal says:

April 8, 2020 at 7:38 pm

No fiber service
Kevin Crawford says:

April 8, 2020 at 7:39 pm

Thanks. Corner of Acton and Ashby is out of service.
Hopefully this can be fixed soon.
My wife is telecommuting for work and needs the internet.
Cheers!
couch potatoe says:

April 8, 2020 at 7:39 pm

How am I supposed to watch Tiger King?!?
Marci Lockwood says:

April 8, 2020 at 7:41 pm

Ours is still out
Jonathan says:

April 8, 2020 at 7:45 pm

the phone is also down
Taj says:

April 8, 2020 at 7:46 pm

What is the ETA? When will this be fixed?
Todd Lee says:

April 8, 2020 at 7:48 pm

Can someone email or text when it is back up and running
Natalie Dunnege says:

April 8, 2020 at 7:57 pm

My internet is also out in San Francisco at 21st and Judah
Danielle Galan says:

April 8, 2020 at 7:59 pm

When will this be resolved?
Christopher Washington says:

April 8, 2020 at 8:00 pm

Thanks for all your hard work. This is the only outage I’ve had since moving to sonic. You all are doing an amazing job and I’m positive you’re working your butts off to handle this issue. Which in the grand scheme of things is not a big deal. Hope you’re all safe and healthy!
Tyler Merrell says:

April 8, 2020 at 8:02 pm

Mine is down also. North Oakland/Emeryville.
Patrick says:

April 8, 2020 at 8:07 pm

Our internet is down, can’t get thru on support numbers. Live in Kensington Ca. We tried rebooting, please advise with a technical support person available.
Dylan Bergeson says:

April 8, 2020 at 8:07 pm

Can you please notify customers when things like this happen? I didn’t get any email or error app notification or text. Would have saved me 45 min of resetting and troubleshooting our equipment, looking for updates, etc. Thanks!
Jon Roach says:

April 8, 2020 at 8:17 pm

We’re in Potrero Hill neighborhood of sf. Any idea when internet is likely back up?
Sncnf says:

April 8, 2020 at 8:21 pm

Give me my internet
Elaine Minton says:

April 8, 2020 at 8:25 pm

When will this issue be resolved as my kids have class tomorrow online due to the Covid19 pandemic therefore cannot be in school since all schools are cancelled for in person studies.
Amy Abascal says:

April 8, 2020 at 8:26 pm

Is DCHP a thing? Do you mean DHCP? Can you tell difference? Maybe this is why our network is down.
Paul Catasus says:

April 8, 2020 at 8:30 pm

Out for 2+ hours!
Vipul Kumar says:

April 8, 2020 at 8:39 pm

Any expected time for a fix?
Brian says:

April 8, 2020 at 8:45 pm

Any idea on an eta for a fix? Shelter in place with no Netflix is not as much fun
Joseph Ng says:

April 8, 2020 at 8:49 pm

Please keep me updated
Daniel Thompson says:

April 8, 2020 at 8:53 pm

Working on getting things restored, Joseph! Once we have more information we’ll post here.
Debbie Lefkowitz says:

April 8, 2020 at 8:53 pm

Unbelievably annoying. No WiFi for 20 person Zoom Seder.
Daniel Thompson says:

April 8, 2020 at 8:53 pm

We hear where you’re coming from Brian. No ETA yet, but our entire engineering team is working on it – all hands on deck. Hopefully we’ll get you up and running soon!
Daniel Thompson says:

April 8, 2020 at 8:54 pm

No ETR yet, but we’re working on it. Sorry about the inconvenience, Vipul, but bare with us and we’ll get your connection back up as soon as we’re able!
C says:

April 8, 2020 at 8:55 pm

Internet Not working in San Francisco mission district. Is this the cause?
Daniel Thompson says:

April 8, 2020 at 8:56 pm

Chag sameach, Debbie! We’re working to get things back up and running for your family as quickly as we can!
John says:

April 8, 2020 at 8:56 pm

Good luck! I love you guys.
Daniel Thompson says:

April 8, 2020 at 8:57 pm

Sorry about that Paul. We’re working to get things fixed ASAP!
Matt Darby says:

April 8, 2020 at 8:58 pm

Thanks for the transparent communication and after hours work to get this fixed!
Daniel Thompson says:

April 8, 2020 at 9:00 pm

That’s the goal, Dylan, and normally we’re able to communicate outages or maintenance directly to our users. Unfortunately, that’s not always the case, but we’re working on it. As soon as we have more information we’ll post an update. Thank you for your patience!
Daniel Thompson says:

April 8, 2020 at 9:00 pm

Not yet, Jon. Our entire engineering team is working on this though, so we’re hoping to get you back up ASAP!
Daniel Thompson says:

April 8, 2020 at 9:01 pm

We want you to have it, and as soon as that’s possible you’ll be back up and running!
Matthew Brensilver says:

April 8, 2020 at 9:02 pm

Thank you for your efforts to restore service. I can imagine that it would be an intense time to be addressing an outage. Be well.
Daniel Thompson says:

April 8, 2020 at 9:02 pm

We don’t have an estimated time of repair as this is an unintended network outage. Our entire engineering team is working to get things back up and running, and we’re doing everything we can to restore service as soon as possible.
Daniel T says:

April 8, 2020 at 9:03 pm

Thanks for catching the typo, Amy!
Daniel T says:

April 8, 2020 at 9:04 pm

You got it Matt! Thanks for your patience, we’ll get you up and running soon!
Greg Perkins says:

April 8, 2020 at 9:04 pm

Wow. 2 hrs with no internet and now my phone line is out as well. SMH.
Benjamin Sheiner says:

April 8, 2020 at 9:04 pm

Thanks for trying guys! I just wanna day thank for the open communication. Hope everything is ok with your engineering team and everyone is staying safe in this trying time!
Daniel T says:

April 8, 2020 at 9:06 pm

Hi Jason, there’s currently a network outage in your area. We’re working to get things repaired, bare with us!
Theodore Conz says:

April 8, 2020 at 9:06 pm

I live in a house in the Mission SF with 3 separate Units each with their Sonic Fiber Account. The top two units both still have service while I do not since around 7pm. Equipment failure or server issue?
Daniel T says:

April 8, 2020 at 9:07 pm

Most of us aren’t done with it yet either, and that’s why we’re here after hours trying to get service restored!
Chris S says:

April 8, 2020 at 9:14 pm

What would be helpful, at least from a communications perspective, would be to commit to providing updates regularly, say every 30 minutes. Sometime those updates might be be no progress, other times something like we’ve escalated to CISCO (or whomever) and . . . , and eventually a “we’ve identified root cause, and the ETR is . . .”. The steady flow of information would save many phone wifi browser refreshes 🙂
Steve says:

April 8, 2020 at 9:14 pm

QUARANTINE NIGHTMARE
Daniel T says:

April 8, 2020 at 9:19 pm

The outage is ongoing, and we’re still working to get things restored. Sorry about that, Andrew!
John says:

April 8, 2020 at 9:19 pm

That’s a going back to X finity this is the 2nd time in a couple of months with no notifications
Daniel T says:

April 8, 2020 at 9:19 pm

The call volume is so high that there might be issues reaching our support team. We’re working to get service restored ASAP, sorry for the inconvenience.
Daniel T says:

April 8, 2020 at 9:20 pm

There is currently no ETR. All of our network engineers are working to get service restored – thank you for your patience!

[UPDATE : RFO] Known Outage in the Northern California Area.

Post navigation

103 comments for “[UPDATE : RFO] Known Outage in the Northern California Area.”

Leave a Reply