Month: November 2015

Email Storage Cluster Event, Planned Maintenance

One of the two pairs of storage clusters we use for email storage had a cluster takeover event at 11:09 this AM.  One of the heads lost communication with all of its disks and its partner successfully took over all of its services without any interruption.  However, while tracing the failure of the first head to a failed FCAL optics package – which was replaced – the cluster interconnect adapter in the second head locked up and triggered a panic.  This panic lead to a brief interruption if POP/IMAP services at approximately 12:00.  The second filer rebooted successfully still in partner takeover.

Unfortunately, since the second filer is still in takeover mode and can’t see the cluster interconnect adapter, the most conservative resolution requires that we halt the second filer and replace the failed adapter.  We’ve tentatively scheduled this for tomorrow after midnight provided that we receive the replacement adapter from our vendor in time.  POP/IMAP services will be offline for the duration of the maintenance which should take less than an hour to complete.

-Kelsey and William

Update 01:00: The cluster interconnect adapter has been replaced and all services have been fully restored.

-Kelsey and William

Fusion/FlexLink Intrusive Maintenance – Petaluma

Update (12:35AM): This maintenance is now complete.

Beginning tonight at midnight we will be performing intrusive maintenance on equipment serving Fusion and FlexLink customers in the Petaluma area. Expected downtime for these software upgrades is around 20 minutes.

– Robbie and Michael

Emergency Router Maintenance

Tonight, November 25, at 1:00am, we will be performing a maintenance reload of core network equipment serving our Santa Rosa data center. No service interruption is expected, however in the worst case scenario customers may experience a brief period of routing instability towards colocation customers and Sonic services such as email and DNS.



Core Router Maintenance – Petaluma

Update (2:30AM): This maintenance is now complete.

Beginning tonight at midnight we will be performing maintenance on a core router in the Petaluma area. Customer traffic will be routed around the router during this operation.

– Robbie and Michael outage.

Ops has observed and fixed a problem with that would have caused problems for users from 5:45pm to 8:50pm this evening. A networking issue between our VM cluster and the storage backend was to blame, and we are doing what we can to prevent an outage of this nature in the future.



Amazon Traffic Outage

This evening, November 18, starting around 5:00pm, Amazon began blackholing traffic towards the Sonic network. We attempted to route through a different upstream provider, but it appears the routing issue was too deep into the Amazon network. We believe they have fixed the routing issue, and all traffic has been restored as of 5:25pm.

Update: This issue appears to have re-surfaced. We are reaching out to Amazon to determine the cause of the outage and do everything we can to ensure it does not happen again.

-Tomoc and the NOC

Fusion/FlexLink Intrusive Maintenance – San Francisco

Update(3:20AM): This maintenance is now complete.

Beginning tonight at midnight, we will be performing maintenance on equipment serving a small portion of Fusion and FlexLink customers in the San Francisco area. This operation can cause downtime for up to a few hours in some cases.

– Robbie and Michael

Core Router Maintenance

Update(3:30am): This maintenance is now complete.

Beginning tomorrow morning at 3:00AM, I will be performing maintenance on a core router in the San Jose area. There should be no downtime as a result of this operation, however, users may notice traffic and routing changes.

– Robbie

Server Upgrades

Update: Maintenance complete.

Tonight at 10pm, Operations will be upgrading one of the servers responsible for handling account and web-based services. During this time, Member Tools and Sign-ups will be unavailable. The expected downtime is 30 minutes.

— Joe @ System Operations Team

Scheduled System Maintenance

Update: Maintenance complete.

Tonight at 11pm we will be performing updates to several systems; Impacted customer facing services will include mail and member tools.  Any interruptions to service should be brief and the maintenance should be completed within 1 hour. Thank you.