Fri Jul 2 14:28:52 PDT 2004 — Brief power event in Santa Rosa. During a UPS system test, our Leibert power system had a failure, causing a power outage of about one minute in our Santa Rosa datacenter. The planned test involved cutting utility input supply power to one UPS system, which should have switched to it’s batteries – instead, it broke the master battery cabinet output circuit breaker, forcing the system to bypass.
This caused a brief network outage while routers and servers in Santa Rosa rebooted, and impacted availability of many services as they came back online. Email POP and SMTP are currently unavailable, and are expected to be back online in a few minutes. Some DSL customers (those homed out of Santa Rosa) lost connection briefly and are now coming back online. Customers in San Francisco may have had problems with domain name resolution. Hosted websites were offline briefly while the web cluster and it’s storage arrays rebooted.
We are sorry for the interruption, and are working with Leibert to determine why there was a failure. We have an additional Leibert power system in the planning phases now, and we’ll assure that we use it to provide redundant diverse-input power to systems and equipment which support this in the future, including key routers, servers and storage. -Dane
Update: After complete testing, the technician found that one battery in the Leibert UPS cabinet had failed completely, and one other was not holding its voltage at the full level. The system was reconfigured without these two batteries and was taken out of bypass and put back into online mode. New batteries are on their way, and we’ve implemented periodic testing procedures to check voltage levels for signs of failure to assure that this doesn’t happen again.