Load Balancer Issues: We’ve just uncovered…

Tue Nov 4 14:36:42 PST 2003 — Load Balancer Issues: We’ve just uncovered that one of our Alteon AD3 load balancing switches is apparently corrupting ethernet frames off of at least one of it’s port with single bit errors. These errors were going completely undetected by the servers or switches; the corrupted frames have the correct checksum information.

The single-bit error corruption in Ethernet frames on this switch was resulting in the transposition of characters in email streams sent to and from the affected servers. For example, the letter ‘A’ might have been translated to the symbol ‘~’, or ‘.’ to ‘x’. In most cases, the errors introduced would go unnoticed — they’d appear to be typos. However, attachments that were corrupted could be rendered unusable and it’s also possible that errors at certain points, or those which introduced certain control characters, could have caused fatal errors.

We are in contact with our vendor to identify if the problem is a hardware or software fault in the switch. We’ve temporarily worked around the corruption by disabling the affected servers. Once we’ve gathered sufficient debugging information, we’ll swap to the standby Alteon which is not exhibiting the problem and re-enable the affected servers. -Kelsey and Nathan

Sendmail Upgrades: We made a small change to…

Mon Nov 3 15:08:51 PST 2003 — Sendmail Upgrades: We made a small change to the sendmail binaries in use on our mail cluster to resolve some infrequent STARTTLS related errors. Normally, this upgrade would have gone without notice. However, the new binaries didn’t have the proper permissions set when they were installed. This didn’t affect normal email flow. However, a small subset of users who use procmail to forward their mail to other addresses off of our servers may find that their mail was not get forwarded until the problem was noticed and corrected. -Kelsey and Eli

Router maintenance Tuesday, 11/4/03.

Sat Nov 1 16:34:56 PST 2003 — Router maintenance Tuesday, 11/4/03. We are scheduling the replacement of some router hardware at the Focal POP in SF at 5 am. We anticipate a 15 minute outage. This will affect dialup access through that POP, as well as some general routing instability as our network routes around the loss and then reconnection to UUNet. -John and Nathan

Graton Rooftop outage.

Fri Oct 31 14:26:20 PST 2003 — Graton Rooftop outage. A switch failure has taken down our wireless backhaul to the Graton Rooftop customer deployments. We are working to repair the problem, but we don’t have an ETA at this time. – Bryan, Eli

Update Fri Oct 31 14:45:45 PST 2003 — Services have been restored. The switch that serves our head-end deployment failed, and was replaced with an onsite spare. Cause of failure is not known. – Bryan, Eli

Local number calling problem.

Thu Oct 30 13:27:34 PST 2003 — Local number calling problem. We are experiencing problems with calls from Santa Rosa SBC phones to our main office number and our non-SBC dialup numbers. We believe this to be a problem with LNP, the local number portability system, that the telephone companies use to direct calls from one carrier to another. We are opening trouble tickets with all the carriers involved. If you need to reach our office, you can use our Focal numbers: (707) 237-9616 Sales and Accounting; (707) 237-9617 Technical Support. -John and Russ

Update Thu Oct 30 16:13:30 PST 2003 — Our carriers report the problem solved. SBC shut down one of their switch routing servers (SS7 SCP) that was corrupted. The other three answer correctly and can carry the load. -John

Mail Delays: Freezer, the NetApp filer that…

Thu Oct 23 09:43:39 PDT 2003 — Mail Delays: Freezer, the NetApp filer that handles among other things, ‘/home’, had a disk failure this morning. The filer behaved properly, failed the disk and began to rebuild onto a spare. However, the failed disk continued to cause problems for the filer, but triggering repeated bus resets. Meanwhile, the SpamAssassin cluster, which relies heavily on ‘/home’ suffered terribly. After removing the failed disk from freezer things are starting to get better but it will still take some time before mail delivery is returned to normal and all queued mail is delivered. At this time, we have disabled SpamAssassin in order to allow mail delivery to resume. -Kelsey

Update: 12:15:32 PDT — All services have been restored: Freezer has finished rebuilding it’s raid, the SpamAssassin servers are back online and all of the back-logged mail queues have been processed. -Kelsey, Nathan, Jared and Russ

Circuit Outage: Our 100mbit link from Santa…

Wed Oct 22 14:17:27 PDT 2003 — Circuit Outage: Our 100mbit link from Santa Rosa to Equinix, San Jose, has been shutdown due to excessive errors and packet loss. We’re currently working with the vendor and hope to have it resolved shortly. At this time, reachability to the Internet in general should be fine, but users may experience some performance impact. This is especially true as one of our Cisco 7200VXR routers in San Francisco just (as I’m writing this MOTD) took the opportunity to crash in a red zone violation, further impacting the network. -Kelsey, Nathan and John.

Update 15:14:27 PDT — All service has returned to normal. We’ll continue to work with our vendors to resolve both issues encountered. -Kelsey

Mail Delay: Some routine work on one of our…

Tue Oct 21 12:22:29 PDT 2003 — Mail Delay: Some routine work on one of our NetApp NFS filers ended up becoming invasive and impacting performance. The SpamAssassin servers were affected the most, causing some mail to pass unfiltered. In order to restore the system, we shutdown mail delivery for a few minutes to let the systems settle down. This only affected inbound email delivery, not outbound mail or users’ ability to check their mail. -Kelsey

Multicast and IPv6 outage.

Tue Oct 21 10:28:46 PDT 2003 — Multicast and IPv6 outage. The router that handles Sonic.net’s multicast and IPv6 traffic locked up this morning and required a power cycle to restore. This event caused a multi-hour outage for these services. -John and Nathan