Month: August 2000

We have had a system failure that led to lost

Thu Aug 24 13:20:49 PDT 2000 — We have had a system failure that led to lost customer email. Beginning at 9:28:53 AM Thu Aug 24 2000 and ending at 12:15:00 PM, three out of our redundant array of five email servers were misconfigured, and silently discarded customer email. Lost messages were not bounced back to the sender, and cannot be recovered. We are shocked by this event, and have eliminated the configuration which caused this. We will be informing all senders and all recipients of each lost email message with full information including the date, time, sender and recipient of each message so that they can be re-sent.

We sincerely apologize for the error, and the inconvenience that it causes.

In addition to elimination of the configuration setup which caused this, we’re committed to deployment of a testing engine which pro-actively sends and receives email through each of the mail servers individually on an ongoing basis. By testing each and every server independently every few minutes, both locally at Sonic.net and from off site, we can stay better informed about any possible loss or delay which might affect our customers email delivery. This is similar to our existing ‘ckhosts’ and ‘ckdisk’ tools which check for availability of web, mail, ftp, Usenet news services and disk space currently.

These tools page our staff of system administrators with text messages anytime a server becomes unresponsive or unavailable, and this allows us to deliver high availability for your Internet services. We will followup here in the MOTD once this testing engine is deployed.

If you suspect that you may have lost email, please stand by for a notification. -Dane, Scott and Kelsey

Just to let everyone know, we’ve increased…

Wed Aug 23 17:17:12 PDT 2000 — Just to let everyone know, we’ve increased the maximum email message size from 7MB to 15MB due to continued demand to up the size limit. We also increased the user file size limit on bolt from 24MB to 50MB. -Steve, Kelsey

SOLVED: Routing loop on Cable & Wireless.

Sun Aug 20 04:46:16 PDT 2000 — SOLVED: Routing loop on Cable & Wireless. CW has completed their emergency maintenance in San Francisco.Apparently the problem snowballed when they tried to replace an interface card that wasn’t supported by their router’s IOS. (Or something like that — details for their 5-hour outage are vague.) The CW T3 is up now, and I am seeing normal traffic loads. -Scott

SOLVED: Packet loss and latency on UUNet.

Sun Aug 20 00:41:57 PDT 2000 — SOLVED: Packet loss and latency on UUNet. (whew!) UUNet now says this was an ATM interface that was having problems. Quote: ‘this is a known issue with Cisco ATM interfaces’. (Oh yeah? Why did it take 20 hours to fix?) Anyway, we now know the magic words to use should it happen again. Additionally, throughout the day, I’ve been writing a packet loss and latency monitoring tool that will automatically notify us of another problem like this. Called ‘See’, the tool keeps a weather eye out for problems with any of our NSP’s — a kind of minimal ‘Internet weather report.’ Please visit news:sonic.net for more information. The Cable & Wireless T3 is still down, as we are awaiting resolution of the problems with their network in San Francisco. -Scott

Routing loop on Cable and Wireless.

Sat Aug 19 23:33:00 PDT 2000 — Routing loop on Cable and Wireless. I guess I spoke too soon. I should have said ‘rest assured we won’t be bringing up the T3 until UUNet fixes their network UNLESS Cable and Wireless dies a miserable and savage death.’ Which they did: their San Francisco network just lost connectivity to the Internet. Just got off the phone with Cable & Wireless: they are conducting emergency maintenance in San Francisco. Man, when it rains, it pours… The UUNet T3 is back up (latency and all) and we will shifting back as soon as CW has completed their maintenance. -Scott

Packet loss and latency on UUNet.

Sat Aug 19 22:33:16 PDT 2000 — Packet loss and latency on UUNet. The gentleman at UUNet tech support informed me that he saw latency. but no packet loss. I had to bring up the T3, demonstrate that there was both latency and packet loss, and then shut it down again. Each time we shut down a T3, we experience intermittent loss of connectivity to some sites while routes converge on the remaining T3’s. I apologize if you noticed a hiccup in the Internet tonight, and rest assured we won’t be bringing up the T3 until UUNet fixes their network. -Scott

Packet loss and latency on UUNet.

Sat Aug 19 21:49:41 PDT 2000 — Packet loss and latency on UUNet. Still no resolution, and indeed, the problem appears to have gotten worse. We’ve shut down the UUNet T3 pending a resolution. Meanwhile, I was able to get around the full mailbox situation, and I’m on the phone with them now while their tech support gets ahold of their NOC. -Scott

Packet loss and latency on UUNet.

Sat Aug 19 20:29:37 PDT 2000 — Packet loss and latency on UUNet. Still no resolution, and calls to UUNet’s multi-megabit support line have been dumped into a voice mailbox that is full for the last few hours. I was able to contact someone via a back channel, but we can’t estimate how long this is going to take — methinks UUNet has seriously dropped the ball on this one. -Scott

Packet loss and latency on UUNet.

Sat Aug 19 16:08:20 PDT 2000 — Packet loss and latency on UUNet. UUNet has (finally!) identified the problem: a switch card has lost its mind. The have already dispatched someone to replace it, and we should see resolution shortly. -Scott and Dane

Packet loss and latency on UUNet.

Sat Aug 19 15:07:11 PDT 2000 — Packet loss and latency on UUNet. Dane and I have been tag-teaming UUNet since early this morning to solve a problem with packet lost and high latency to some other UUNet customers. We are starting to growl at them, as over 12 hours is just too long for this type of problem. Shutting down the UUNet T3 (and running through Cable & Wireless) won’t help the situation, as we see the same high-latency and loss when going through CW. Updates as they occur… -Scott and Dane