Month: April 2020

SOMA SF – Fusion/Flexlink Emergency Maintenance

Update: 12:25AM All work has been successfully completed for the night.

Tonight (4/13/20) at 11:59pm, we are performing emergency maintenance on network hardware serving Fusion and Flexlink copper customers in the SOMA area of San Francisco.  Expected downtime is 15 minutes.  An update will be provided when our work is complete.

Support Callbacks Requested 4/8 Purged

During the outage last night, a tremendous number of impacted customers scheduled callbacks through our phone system. After resolving the outage, we found that the overwhelming majority of these outstanding requests have been cleared of the outage, but due to the volume, are diminishing the ability for customers to reach support.

We have purged all pending callbacks from 4/8 from our system. If you are continuing to experience issues, please give our support team a call at your earliest convenience so we can get your service restored.

[UPDATE : RFO] Known Outage in the Northern California Area.

–START RFO–

Some Sonic fiber customers experienced an internet outage on the evening of Wednesday, April 8th. Below is the Reason For Outage (RFO) write-up, an internal process document which we are sharing to provide some insight into this unprecedented event.

On April 8th, 2020 at 6:44 PM PDT, one of the large core routers in the Sonic network failed. We have redundant equipment that is configured to take over the load immediately. Unfortunately, due to the way the router failed, the failover time to our redundant router was slower than expected and took a bit over forty seconds. During this time, roughly half of the inbound, and a significant majority of outbound internet traffic was stalled. Most data traffic in Northern California was affected, though data traffic on IPBB and in Southern California was unimpacted. Voice services network-wide were also briefly impaired.

Once traffic rerouted around the failed core router, we noticed traffic levels were much lower than expected. We were missing many tens of Gbps of traffic on transit and peering sources combined. Our Network Engineering team began investigating immediately.

After routing stabilized, we began receiving reports via Sonic’s support team of some Fusion Fiber customers in an offline state. After verifying examples, we noticed that affected customer premise equipment was stuck in a DHCPDISCOVER state, unable to obtain an IP address. Rebooting the home router and ONT (Optical Network Terminal – the devices that terminate the fiber optic connection at the customer premises) did not work to obtain an IP address and restore service.

What we found was that the brief interruption in internet access caused some customer premise devices to re-send large quantities of DHCP messages which overwhelmed their associated customer aggregation routers across our network. This resulted in those aggregation routers becoming overloaded and unable to hand out new IP addresses. Customers that had an IP address before the outage and did not release the address during the minute-long outage window were unimpacted.

Throughout the evening, the Network Engineering team worked to isolate the cause and mitigate the issue. We were concerned that any heavy-handed approaches would cause a larger outage for the other customers that were not affected. We found a potential fix for this by restarting an internal process on the aggregation routers to clear out DHCP queues that were full due to the flood of requests. However, while this resolved the issue for customers in some regions, it wasn’t enough for some heavily-loaded devices.

To fully resolve the issue the Network Engineering team then deployed a configuration change to severely bottleneck the inbound DHCPDISCOVER messages. That slowed the flood enough for even the most populated aggregation routers to slowly start handing out IP addresses to customer home routers without being overwhelmed. The last few users were brought back up shortly after midnight. Coincidentally, this is also around the time when we were able to replace the trigger for all of this – the core router mentioned earlier.

We have since reached out to the vendor we use for the customer premise equipment affected in this outage. Their devices detect internet outages within 30-45 seconds and will then re-request DHCP. This is close to the 40 seconds that it took for the aforementioned failover to take place and is the likely cause why so many devices in our network started to re-DHCP and thus triggered the congestion that prevented customer routers from coming back online.

Moving forward, we are now finalizing fixes to our aggregation routers to reduce the impact of similar events.  We’re also investigating the reason for the long core router fail-over time to reduce the impact of any similar issues. Finally, we are having an ongoing post-mortem discussion with our vendor architects and engineers on this incident and will be seeking ways to prevent a similar issue from happening again.

–END RFO–

 

Update: 11:59pm We believe all services have been restored. If you are still impacted or otherwise experiencing service issues please contact our technical support team tomorrow morning.

Update: 11:30pm We have begun to restore regions in a staged fashion as we carefully monitor progress.

Update: 10:54pm We continue working on resolving the issue.

Update: 10:24pm We continue to work on resolving the issue.

Update: 9:54pm We are still working on resolving the issue.

Update: 8:53pm We currently have our network operations team looking into the DHCP issues (we started around 7pm tonight) As of right now there is no new information. We will be posting updates once new information is given to us. Thank you for your patience!

As of 7:00pm we are experiencing issues with our network that is causing DHCP issues. This is causing customers to not be able to surf or access the internet. Currently we are only seeing this issue on our Sonic Fiber services. We thank you for your time as we work to resolve this.

 

 

Brief Outage

A device failure caused a brief routing hit, which lead to a momentary traffic hit. Traffic has failed over to a backup device as designed. We are investigating.

-Network Engineering

Phone switch outage

Due to a configuration error on our phone switch, we experienced a disruption in phone calls lasting around 5 minutes. The issue has been resolved.

-Network Engineering

Core Router Maintenance

Update 12:35AM: Maintenance has been successfully completed.

Thursday, April 9, starting at 11:59pm, we will be performing maintenance on core equipment. No customer impact is expected. This maintenance window is 3 hours.

-Network Engineering

Core Router Maintenance

Update 12:57AM: Maintenance has been successfully completed.

Tuesday, April 7, starting at 11:59pm, we will be performing maintenance on core equipment. No customer impact is expected. This maintenance window is 3 hours.

-Network Engineering

Legacy DSL outage in the Bay Area

UPDATE 04/03/2020 11:04 am – The outage is now cleared and service restored.

A small subset of Legacy DSL customers are currently affected by a service outage due to hardware failure. Our engineers are working toward resolution though we have no estimated time of repair. We will update this post with further developments. Apologies to those of you impacted by this outage.

IMAP/POP3 Server Migration

Starting at 11pm this evening, SOC will be moving part of our IMAP and POP3 mail cluster to new network equipment. We expect the work to take 2 hours, and no downtime is anticipated.

-SOC