Fri Nov 1 16:00:38 PST 2002 — Ultra, one of our 5 load balanced mail servers entered an unusual failure mode where it could no longer resolve the IP of our outbound SMTP server cluster. 797 outbound email messages were returned to their local senders before we were made aware of the problem and removed ultra from the mail server pool. This only seems to have affected outbound email delivery from ultra. We apologize for the problem and are working to make sure that it does not occur again. -Kelsey, Eli and Scott
Author: admin
On Monday night at midnight we will be making
Fri Nov 1 14:42:28 PST 2002 — On Monday night at midnight we will be making a full test of our power generation facilities. While we certainly don’t have any reason to expect an interruption of power during this transition, it is possible. We plan to run the ISP on diesel for about 30 minutes during this test.
This will be the first full load test of the generator. Two previous partial load tests and periodic no load tests have gone well. Once the system is proven at full load, we will be doing periodic full load runs at least once per month during the daytime.
Sonic.net’s power generation system is a 24 liter V-12 twin turbocharged Detroit Diesel, which generates 1024 horsepower and 750,000 watts of power. This is enough electricity to power a small town of about 750 homes – or, one rather large ISP. A huge Leibert UPS array keeps us online during generator startup.
Tue Nov 5 10:32:16 PST 2002: Update; due to a scheduling difficulty, this test has been delayed until Tuesday at midnight.
Wed Nov 6 00:35:06 PST 2002: Update; as I write this, Sonic.net is running entirely on diesel power. The full transition test went smoothly, and all power generation and transfer systems operated as expected.
Our power generation plant can keep Sonic.net running indefinitely in the event of a utility failure. We have enough diesel on site currently to run for a week, and a fueling truck is scheduled to visit as often as we need. -Dane
BroadLink had a scheduled power outage at one
Fri Nov 1 14:23:47 PST 2002 — BroadLink had a scheduled power outage at one of their tower sites this morning, but the UPS system failed. They replaced the equipment quickly to get customers back online. -Dane
Update, Fri Nov 1 17:32:08 PST 2002: A second outage occurred, and has been resolved. We expect at least one more once PG&E competes their work. As it’s both informative and funny, I’ll include an excerpt from the internal Sonic.net/BroadLink staff discussion list that explains the trouble. The following was written by BroadLink’s wonderful Jason Kane:
Regarding what happened:
As noted in the previous message PG&E was putting up a new power poll across the street from the tower site. As a result everyone in that area lost power for the day. It’s my understanding that wire line power will be restored in a few hours.
We originally believed that the scheduled power outage would not effect our customers since we have battery backup and a generator to re-charge it. The UPS failed immediately and the tower went dark. That was this morning.
To fix the problem we replaced the UPS with a mostly-charged unit, gassed up the generator, plugged our hardware into the UPS and plugged the UPS into the generator. Everything came back up and we figured our only problem was making sure the generator had plenty of gas. But the universe decided that today would be a good day for tweaking with the otherwise idyllic lives of BL and it’s faithful customers. As you may be aware, the tower took a nose-dive about twenty minutes ago. We had Tim nearby so he checked it out. The assumption was (of course) that the generator ran out of gas. But low and behold the generator was still cranking along without a hitch (a questionable metaphor but you get the idea). The UPS on the other hand was in a world of it’s own. And in that world, restarting every few seconds is some sort of imperative.
When we unplug the UPS from the generator it emerges from this malady and runs the tower off battery. But as every little gelfling knows you can’t run on batteries forever, even if you’re a pink bunny. So we rework the setup and plug everything into the generator. Fingers crossed the switches are flipped and the information age continues.
It is my belief that the power from the generator was not smooth/clean enough for the UPS. While it’s rather strange to have a UPS that’s more picky about clean juice than the switches, radios, management units and imported dancing hula girl lamps that are stuffed into the tower we’re forced by irrevocable circumstance to continue living without an adequate answer to such questions.
Here’s the basic sequence of events:
8:20am – PG&E disconnects power to tower 8:23am – broadlink battery backup system fails … outage 9:33am – battery system replaced, generator added 2:43pm – power supplied by generator kills replacement battery system … outage 3:33pm – tower rewired to run directly off generator, service restored
PG&E is supposed to finish their work today so we can rewire to run off primary power once again. We’ll test the battery system and replace it if needed to prevent a similar problem from recurring. We also now know that our generator can’t be used to re-charge our battery backups.
-Jason (BroadLink)
I hope that you found this as amusing as I did. -Dane
SpamAssassin and Graymail updates: We’ve just
Thu Oct 31 13:14:49 PST 2002 — SpamAssassin and Graymail updates: We’ve just completed upgrading SpamAssassin to the latest stable release and added a new feature to the GUI. Users can now optionally choose to have our servers dump blacklisted senders so they’ll neither end up in your inbox nor graymail. Please keep in mind that it does no good to blacklist an address that isn’t going to be reused by the sender; most spam has uniquely generated From addresses. We have received overwhelmingly positive feedback for SpamAssassin and over 30% of our members have turned it on. SpamAssassin is now enabled by default for all new accounts. We strongly recommend that anyone who hasn’t enabled SpamAssassin and graymail do so now. You can enable it in the member tools at sonic.sonic.net/membertools/spamcan -Kelsey and Chris B.
sms reboot.
Sun Oct 27 09:05:20 PST 2002 — sms reboot. Our sms decided to reboot itself this morning, causing about a 10 minute outage with DSL and FRATM T1’s. When the Redback sms crashed, it wrote a crash dump to a flash card, which is why it took so long to come back up. We will be forwarding the crash dump to Redback for analysis. -Scott and Dane
Busies on 9811 dial-up numbers.
Fri Oct 25 08:23:51 PDT 2002 — Busies on 9811 dial-up numbers. We are currently experiencing busy signals on our Focal dial-up numbers in San Francisco. We are on the line with Focal and will work to resolve this quickly. Please note, we have other numbers available which can be found at www.sonic.net/cgi-bin/pops.pl -Matt
Update: Fri Oct 25 10:24:53 PDT 2002 — Focal dial-up issue resolved. Focal discovered the problem and we have worked together to resolve this issue. If you have further problems please call Tech Support to troubleshoot.
We believe that we have located the cause of…
Wed Oct 23 17:25:50 PDT 2002 — We believe that we have located the cause of the Alteon’s instability and will be reconfiguring it and two of our other switches at 1:00AM tomorrow morning. We don’t believe that it will cause any interruption of service, however, if it does become invasive, the downtime will be very brief. -Kelsey and Nathan
Update: Night Ops completed. We reconfigured the Alteon and the two other switches involved in our server load balancing. We believe this should resolve the ongoing stability issues. Incidentally, while taking the opportunity to continue deploying multicast on our core network, we encountered a software bug in one of our core Cisco 7507 routers causing it to lock up. It continued to forward traffic until we rebooted it, temporarily isolating our San Francisco POP from the rest of our network. This served as an opportunity to test redundancy in our core network; no one, apart from customers connected in San Francisco, would have noticed the event. -Kelsey, Nathan, Zeke and Matt
Night Ops Completed: We performed all of the…
Wed Oct 23 02:37:33 PDT 2002 — Night Ops Completed: We performed all of the planned maintenance without any customer impact. It appears the Cable and Wireless did not enable multicast routing on our upstream router but multicast is tested and functional within our network as intended. Hopefully they will finish configuring our upstream router quickly so we can offer multicast streams to our DSL customers. -Nathan, Zeke and Kelsey
Software issue with Alteon.
Wed Oct 23 13:32:34 PDT 2002 — Software issue with Alteon. A failure of the link between our Alteon and switch caused locally hosted web sites and email to be unavailable for 15 minutes. This time, rebooting the Alteon did not fix the issue. We are monitoring the device closely and continuing to work on a long-term redundant solution to avoid these types of problems in the future. -Matt, Kelsey, Eli and Scott
Busy signals on 707-823-8812.
Tue Oct 22 23:29:58 PDT 2002 — Busy signals on 707-823-8812. We discovered that a bad modem card in our Sebastopol dial group was causing fast busy signals during peak calling times. The card has been taken out of service for repair. We will add the new card when call volume is low. Customers should not notice any more problems with this number. -Matt and Russ