Fri Apr 20 11:42:51 PDT 2007 — Server Upgrades. We have been quietly upgrading many of our servers and clusters over the past few months to improve the overall quality of our ISP services. The upgrades include the complete replacement of our SpamAssassin cluster with six new quad-core Xeon servers, a new web cluster server member, four new internal DNS servers, as well as the recent deployment of a pair of new FTP servers and the addition of two new inbound MX servers this morning. These upgrades, in addition to others not mentioned, allow us to continue to provide the high quality always-on services that our customers have come to expect. It is gratifying to see all of our hard work and preparation pay off. For instance, our careful selection of power sources for individual machines and networking hardware kept all of our cores services available during the recent UPS event even though roughly half of our systems lost power. -Kelsey, Nathan, Augie and Dan
Month: April 2007
Power issue update.
Thu Apr 19 16:54:52 PDT 2007 — Power issue update. The UPS technicians expect to finish the repair within the next couple of hours. Provided that it completes it’s tests and inspection we will transition live load back to the UPS at 10:00PM tonight. We do not anticipate any service interruption however will be fully staffed to handle the unexpected. This UPS serves a small number of customers in our colo facility and redundant load from our own server clusters. -Nathan, Kelsey, Russ
Update Thu Apr 19 22:55:20 PDT 2007 — The UPS is back in service and supporting critical load. Many thanks to the UPS technicians from JT Packard. -Nathan, Kelsey, Russ, Dane, and Matt
Power issue at our Santa Rosa datacenter…
Wed Apr 18 11:37:52 PDT 2007 — Power issue at our Santa Rosa datacenter update. We confirmed that the UPS experienced a catastrophic massive internal failure. At this time we are continuing to run the UPS’ load on external bypass to PG&E service power with our generator running in the event that PG&E service is interrupted.
www.sonic.net/ups-failure/IMG_3332.jpg www.sonic.net/ups-failure/IMG_3321.JPG www.sonic.net/ups-failure/IMG_3326.JPG www.sonic.net/ups-failure/IMG_3318.JPG
-Nathan, Kelsey, Dane and Russ.
Power issue at our Santa Rosa datacenter.
Wed Apr 18 07:46:24 PDT 2007 — Power issue at our Santa Rosa datacenter. This morning at approximately 6:49am one of the UPSes that feeds our Santa Rosa colocation facility dropped its critical load, causing the 15 customers connected to the UPS to lose power. When we arrived on-site to investigate, the power room where the UPS is located smelled of burnt plastic and the input circuit breaker to the UPS was tripped. We placed the load into external bypass around 7:10am, at which time power was restored.
At this time, we surmise that a massive internal failure inside the UPS caused the fault. A UPS technician is currently en route to do further diagnostics, required repairs, and to start the unit back up. Until fixed, we will be operating with the UPS’ critical load supported by our Automatic Transfer Switch. Our generator is running in the event that our building PG&E feed fails.
On the plus side, Sonic.net services such as mail, web, ftp as well as our core networking equipment was unaffected by the outage. All of this critical infrastructure is fed by both of our datacenter UPSes to handle circumstances such as this.
-Nathan, Kelsey, Dane, Jen, Clay and everyone in Support
Power maintenance.
Thu Apr 12 17:29:24 PDT 2007 — Power maintenance. At 8:00am on Saturday, April 14th one of the San Francisco colocation facilities we use for customer termination will experience a planned power outage. A number of circuit breakers inside power distribution units located on the colocation floor have failed infrared testing and are being proactively replaced before they can cause unplanned downtime. While we have been given a 4 hour outage window, we expect the total power loss to be under 2 hours.
All of our critical equipment is dual-fed from redundant feeds, and as such we do not expect any customer impact. However, there is the possibility for cascading power failure as well as the failure of our transit and transport providers at that facility. We will have staff on-site to facilitate restoration in the event of a power failure, and our Network Operations Center will be fully manned to deal with any networking problems that arise.
We’ll keep the MOTD up to date in the event of any problems, but expect things to go smoothly.
-Nathan, the Network Operations Center and Tech Support
Update Tue Apr 17 09:44:20 PDT 2007 — The power outage was a non-event. Despite half of the colo facility going dark, Sonic’s gear remained up and fully operational as the PDU breakers were replaced. It’s a joy to see our carefully planned redundancy work as expected, and quite a thrill to watch large circuit breakers tripped on live loads! -Nathan and Matt