Cascading IMAP/POP3 failures this morning

At 11am our back-end IMAP/POP3 cluster entered a critical state which lead to an interruption in those services, as well as other services that rely on our mail infrastructure. The initial cause of the failure was a routine maintenance procedure that involved dropping traffic to a portion of the cluster. While the remaining cluster should have been able to run temporarily with a smaller group, that quickly turned out to not be the case. The remaining servers began to fail intermittently as they tried to shift traffic to account for the sudden increase in load. This would have caused noticeable mail client issues, and it also led to service availability interruptions on both our webmail and our voicemail platforms.

As soon as the problem was detected, we acted by aborting the maintenance. This was followed by additional resources being added to the cluster to prevent further disruptions. As of 11:34am service was fully restored. We do not expect loss of email to have resulted from this. As always we will look into improving our metrics and analytics to improve our response time.

Leave a Reply

Your email address will not be published. Required fields are marked *