San Rafael Central Office Reason For Outage

On Sunday, at approximately 8:30 PM PST, our on-call Network Engineering staff began to receive network event alerts for some of our core routers losing power in a Central Office that we collocate in. This site acts as both a serving point for our Sonic Fiber customers and as a core routing node for the junction between our ‘North Bay Ring’ and our ‘South Bay Ring’. Upon initial investigation, an incident call was started for the Network Engineering team to begin assessing possible failures and mitigations and initiating our Outage Response processes with our Customer Care department and we dispatched our on-call Central Office Technician. While the Care team sent notices directly to affected Sonic Fiber customers, more network alerts that we lost power to one device after another at this site begin to flow in.

Initial speculation swirled among the team that the historic flooding occurring in Marin County, combined with information from the Marin Emergency Services website indicating a fire truck responding to the exact street address of our equipment began to paint a picture of likely damage to the power systems that serve both the ILEC, and customers like ourselves within the building caused by flooding. This speculation proved to be true as we arrived on site to find additional ILEC repair technicians arriving. A rough estimate of 3 hours was given to finish pumping the water out of the basement, to be followed by an unknown length of time to assess the damage to the electrical equipment and begin repairs. We were prevented from accessing our equipment to check fuses due to safety concerns from the repair team. Our technician reported that the building smelled very badly of ‘burnt electronics’.  By around midnight, power was restored and our devices automatically turned on and begin to work again. By around 2 AM all services were mostly restored. Unfortunately, our OLTs (Optical Line Terminal) at this site powered back on in a degraded state internally whereby they didn’t allow enough bandwidth to transmit between the OLT and the rest of the network, causing bandwidth bottlenecks for our Fiber customers. The only fix for this was a night-time scheduled maintenance to restore full-speed connectivity which was performed at 11:59 PM last night to avoid impact during peak usage hours.

In Sonic’s history collocating in ILEC Central Offices, failures like this have not happened. This was a first for us. Central Offices are engineered and built with resiliency to withstand many failure modes. The power that our routers and network equipment draw are fed from two redundant battery banks and provide DC voltage to our equipment. Sites have generators that are regularly tested in case of utility power failure. Facilities are strictly maintained to high standards and federally regulated to ensure failures like this are as unlikely as possible. In addition to this, we have built our network to have redundant paths, redundant locations, and recovery plans to handle truly disastrous failures. Despite this, 911 service across Marin County was severely degraded for nearly all carriers, not just Sonic. There were reports of various LTE services with other carriers being impacted as well, and of course, Sonic customers lost internet and phone access during peak usage times.

As mentioned earlier, this particular Central Office is both a serving point for Sonic Fiber customers, and a core site that carries traffic from points north of San Rafael, to points south, ultimately to data centers in Silicon Valley where we peer with other networks that make Sonic part of the Internet. Approximately 2,700 Fiber customers lost service when the OLTs lost power (and the other devices within the building). Events like this are impactful, especially on a Sunday night when many people are trying squeeze the last bit of personal and family time out of the weekend by enjoying our service. Being a critical point where our North Bay Ring meets our South Bay Ring, this could have created widespread service reliability issues for all of our North Bay (points north of San Rafael) customers if we hadn’t engineered our network to have multiple paths, multiple sites, and critically, enough bandwidth to handle failure scenarios as internet traffic automatically re-routes around the ring to avoid the failed node.

While there was nothing that Sonic could have done to prevent the flooding that took out power, we strive to do our best to be open, transparent, and communicate with our customers, and the broader public – and continue to build reliable networks so we can provide the best service in the business. It’s our goal that you don’t even need to think about your connection. When we are doing our job right you should never even have to think of us – it should just work.

Post navigation

Leave a Reply

Your email address will not be published. Required fields are marked *