Tue Oct 19 08:17:40 PDT 2004 — (Updated) Power outage in Santa Rosa. During the storm, we suffered a brief power outage, which triggered a failure in our Leibert UPS battery plant. This resulted in downtime for most of our services, beginning at 5:57AM, and lasting almost an hour. Email took a bit longer to bring online, and is available now – no email was lost, and any pending deliveries were queued.
Currently we’re running with the UPS bypassed and with our diesel generator running on standby in case of another utility outage. We will be working with Leibert service to determine why the UPS failed – our experience with the Leibert has been very, very disappointing, and while I normally hesitate to name a vendor here in the MOTD, I’m making an exception in this case.
Update: We’re back on UPS, after finding and eliminating a bad battery cell. We also found three other batteries which were dropping below the ideal voltage when a load was applied. All of these batteries were recently tested, so we’re investigating why we’ve got such a large number of battery failures.
Subsequent to this outage, we’re working up a list of lessons learned, and items for our to-do list, to avoid the impact of this type of failure in the future. Here’s a few of the items we’re beginning work on now:
Deploy additional battery cabinet in parallel – cost, around $17,000. Coordinate with Leibert and test to assure that system is RELIABLE! Schedule and execute full live system tests during 3AM maint window. Increase descriptiveness of text paging, and the number of people paged. Add UPS contact monitoring via Honeywell system and manned call center. Deploy RADIUS and DNS servers in San Francisco datacenter for redundancy. Test network routing absent Santa Rosa datacenter to assure reachability. Deploy an additional UPS here for systems and routers with dual inputs. Add UPS to voicemail system, and increase runtime on phone system UPS. Drill process for customer notification and internal response.
We are committed to learning from this failure, and to making changes to assure that this type of outage doesn’t occur again. I’m very sorry about the downtime. -Dane