Month: May 2015

System Maintenence Tonight.

Update: Maintenance complete

Tonight starting at 11:59pm SOC will be updating software on some of our core systems. The following services may experience brief interruptions:

  • Website hosting
  • IPv6 tunnels
  • Incoming and outgoing mail

We will also be upgrading the SSL certificate for imap.sonic.net from SHA1 to SHA256. This is the last of our SSL certificates that we need to upgrade so we don’t expect most clients to have problems, but very old mail clients may not support the new certificate.

 

-Grant, Joe, and SOC

Fusion VDSL2 Intrusive Maintenance – Forestville

Update: This maintenance is complete.

Beginning tonight at midnight I will be performing intrusive maintenance on equipment serving a small portion of Fusion customers in the Forestville area. Expected downtime is around 15 minutes.

– Robbie

Credit card processor down

UPDATE: Our vendor got back to us and we now have the problem resolved.

Currently our credit card processor is down and we are unable to process new payments. We have already contacted our vendor but unfortunately we do not expect to have a resolution until early tomorrow morning.

-William

Intrusive Network Maintenance – Brentwood

Tonight beginning at 11:59PM PDT we will be performing a software upgrade of equipment serving the Brentwood/Pittsburgh/Antioch/Concord areas. This maintenance is expected to last 30-45 minutes and may potentially be service impacting for the duration.

-Tim J.

Network Maintenance – Legacy DSL

Update (2:22AM): This maintenance is now complete.

Beginning tonight at midnight, I will be performing maintenance on equipment that serves legacy DSL customers in northern California. Although the majority of the equipment I will be working with is redundant, a small portion of customers may experience some downtime.

– Robbie

UPS Failure Redux

First, we’d like to clarify the extent of the problems causes by the UPS failure and subsequent dropping of load in the Datacenter.  This had no impact on any residential or enterprise connectivity services including Legacy DSL, Fusion and Fusion FTTN.  The UPS that failed was the smallest of the three UPSes in Santa Rosa and we had been working to migrate load from it.  As such, less than 20 customers in total lost some or all of their power circuits, some of which may have been part of redundant A/B circuits.  Some colo customers lost connectivity as several distribution switches did loose power.  Most sonic services, including pop, imap, webmail were not affected or only saw a brief outage as single PSU equipment rebooted and/or clusters converged as load shifted to systems that were unaffected.  The only public service that had lingering issues was our webhosting cluster which required a little manual attention for it to come online.

The outage was eventually caused by a physical failure of the maintenance bypass switch – one of the phases in the switch stuck and/or didn’t close correctly –  in the bypass cabinet for the PDU we were moving.  In hindsight, it is unfortunate that we chose to operate the switch in the first place as it wasn’t strictly the simplest way to migrate the load.  The last power failure in the datacenter was in Oct ’04 — where the same, UPS failed.

We will schedule migration off of the temporary feeds put in place in the coming weeks.  This final move is significantly easier to execute and has an exceedingly low likelihood of causing any service interruptions.

-Kelsey, Russ, and the rest of System and Network Operations.

 

intermittent dns failure

Between 5:00 pm yesterday and 9:00 am today, customers may have experienced intermittent DNS failures or slower than normal name resolution. At 9:00 am this morning we noticed a configuration failure on one of our name server clusters. We immediately disabled the cluster which allowed traffic to flow over to our other redundant cluster. We have since addressed the issue and restored the cluster to working service. We are currently investigating our monitoring procedures to identify why this issue wasn’t detected earlier and to make sure it doesn’t happen again. We apologize for any inconvenience this may have caused.

– William & Kelsey

UPS Failure in Santa Rosa Datacenter

One of the three UPSes that handles load in our Santa Rosa datacenter failed early this morning and tripped into bypass.  Unfortunately, the internal failure is significant and at least involves the primary IGBTs.  We are exploring our repair options but the most likely outcome is that we will be accelerating the planned decommissioning of this UPS and migration of its associated PDU to one of our other two UPSes.  This is something that we had planned on completing at some point in the next six to twelve months but have not yet scheduled or scripted.  It is a relatively straight forward procedure but must be executed with great care to ensure both the safety of our workers and that live load in the datacenter is not dropped.  Updates will be posted as needed.

Current status: Our standby generator is currently running to enable the ATS to transfer load without interruption in the event that our primary PG&E power feed drops.

Update: Friday 14:00, we have electricians on site placing the cable to move the PDU from the failed UPS to one of our other UPSes.  We plan to complete the migration as soon as the cable is staged and ready to go.  Once the cable is placed, the new target UPS will be placed into maintenance bypass.  This allows us to transition the PDU from the old bypassed UPS to the new UPS without dropping its load.  Once the cable is terminated, the breaker on the target UPS is closed, the old breaker can be opened completing the transition.  At this point, the target UPS will be restarted.

Update: Friday 15:05, we’re beginning the bypass procedure now.

Update: Friday 15:15, unfortunately, load the PDU was dropped momentarily but we are continuing to complete the migration.  Power was lost to several of our single PSU systems but most affected services have already been restored.  More information forthcoming.

-Kelsey and Russ