UPS Failure in Santa Rosa Datacenter

One of the three UPSes that handles load in our Santa Rosa datacenter failed early this morning and tripped into bypass.  Unfortunately, the internal failure is significant and at least involves the primary IGBTs.  We are exploring our repair options but the most likely outcome is that we will be accelerating the planned decommissioning of this UPS and migration of its associated PDU to one of our other two UPSes.  This is something that we had planned on completing at some point in the next six to twelve months but have not yet scheduled or scripted.  It is a relatively straight forward procedure but must be executed with great care to ensure both the safety of our workers and that live load in the datacenter is not dropped.  Updates will be posted as needed.

Current status: Our standby generator is currently running to enable the ATS to transfer load without interruption in the event that our primary PG&E power feed drops.

Update: Friday 14:00, we have electricians on site placing the cable to move the PDU from the failed UPS to one of our other UPSes.  We plan to complete the migration as soon as the cable is staged and ready to go.  Once the cable is placed, the new target UPS will be placed into maintenance bypass.  This allows us to transition the PDU from the old bypassed UPS to the new UPS without dropping its load.  Once the cable is terminated, the breaker on the target UPS is closed, the old breaker can be opened completing the transition.  At this point, the target UPS will be restarted.

Update: Friday 15:05, we’re beginning the bypass procedure now.

Update: Friday 15:15, unfortunately, load the PDU was dropped momentarily but we are continuing to complete the migration.  Power was lost to several of our single PSU systems but most affected services have already been restored.  More information forthcoming.

-Kelsey and Russ

Non-Intrusive Network Maintenance – 5/20/15

This maintenance is now complete.

Tomorrow night (5/20/2015) beginning at 11:59PM PDT we will be performing software upgrades on core routers in the bay area. No customer impact is expected from this as traffic will be re-routed during the maintenance.

-Tim J.

Intermittent Performance – Legacy DSL.

Update (9:28AM): The cause of the performance issues has been located and a workaround put into place.

We are currently investigating an issue causing intermittent performance and connectivity to some legacy DSL customers. We will update this message once we have more information regarding the situation.

– Robbie and the NOC

Fusion/FlexLink Intrusive Maintenance

Beginning tonight at midnight I will be performing intrusive maintenance on equipment serving Fusion and FlexLink customers in Anaheim.  Expected downtime is around ~1 hour.

-Brandon

 

Maintenance is taking a little longer than expected, extending the estimated time out another hour.

-Brandon

 

Maintenance is now complete, thank you

-Brandon

Server Maintenance

Tonight, May 11th, at 11:59pm, Operations staff will be performing minor work on the server cluster which runs the Forums, Member Tools and Webmail. These sites may be unavailable for a short period as each server is taken offline in turn. We expect the work to take no more than 1 hour.

– Dexter, Joe, and Sysops Team

Voicemail & MemberTools Outage

A regular automatic system update applied at 04:56AM this morning broke the Fusion Voicemail application services, including the internal RPC services used by the Membertools to check the status of a user’s Voicemail account.   The simple fix was applied at 8:28AM once the issues were brought to our attention.  During this period calls to Voicemail resulted in a fast-busy and mebertools logins would return a blank page.  Unfortunately, our monitoring did pick up the failure but the alert wasn’t recognized as a service impacting issue.  -Kelsey

Fusion/FlexLink Intrusive Maintenance

Beginning tonight at midnight I will be performing intrusive maintenance on equipment serving Fusion and FlexLink customers in the East Bay and a small subset of customers in Santa Ana.  Expected downtime is around 45 minutes.

-Brandon

 

Maintenance is now complete

-Brandon