UPS Failure Redux

First, we’d like to clarify the extent of the problems causes by the UPS failure and subsequent dropping of load in the Datacenter.  This had no impact on any residential or enterprise connectivity services including Legacy DSL, Fusion and Fusion FTTN.  The UPS that failed was the smallest of the three UPSes in Santa Rosa and we had been working to migrate load from it.  As such, less than 20 customers in total lost some or all of their power circuits, some of which may have been part of redundant A/B circuits.  Some colo customers lost connectivity as several distribution switches did loose power.  Most sonic services, including pop, imap, webmail were not affected or only saw a brief outage as single PSU equipment rebooted and/or clusters converged as load shifted to systems that were unaffected.  The only public service that had lingering issues was our webhosting cluster which required a little manual attention for it to come online.

The outage was eventually caused by a physical failure of the maintenance bypass switch – one of the phases in the switch stuck and/or didn’t close correctly –  in the bypass cabinet for the PDU we were moving.  In hindsight, it is unfortunate that we chose to operate the switch in the first place as it wasn’t strictly the simplest way to migrate the load.  The last power failure in the datacenter was in Oct ’04 — where the same, UPS failed.

We will schedule migration off of the temporary feeds put in place in the coming weeks.  This final move is significantly easier to execute and has an exceedingly low likelihood of causing any service interruptions.

-Kelsey, Russ, and the rest of System and Network Operations.

 

Leave a Reply

Your email address will not be published.

*