Tue Apr 17 11:51:41 PDT 2001 — We just experienced an odd set of circumstances which caused outbound email from customers to be delayed. If you were trying to send email and found it to time out, please do send/receive again to dump the messages in your out-box. No email was lost. Downtime for actual transmission of email was about fourteen minutes.
Our four primary mail servers, hosting SMTP (outbound) and POP (inbound) mail are hosted behind a load balanced switch designed to prevent single point failures impacting end-users. However, due to the reboot of a secondary nameserver, we found a set of conditions that could trigger a failure. Each mail server uses at least two nameservers, but the primary one on all four mail servers was rns2.sonic.net, 208.201.224.33. When this system was undergoing maintenance, all four mail servers fell back to their secondary for reverse DNS lookups, rns1.sonic.net, 208.201.224.11, but took an extra 30 seconds for each new connection to fall back. This caused the Alteon load balancing switch to mark the mail server as unresponsive. With all four running slow due to the DNS server being down, the Alteon effectively shut down SMTP services.
To prevent this possibility in the future, the four email servers now use different primary and secondary DNS search orders. We’ve also asked Alteon for changes to the health monitoring where if all servers for a particular service are slated to be removed from service, it performs more lenient health checks on them to see if they’re just running slow.
-Dane, Russ, Kelsey, Scott and Eli