Sonic Status July

We brought the additional 75GB disk in…

July 31, 2002

Wed Jul 31 04:23:05 PDT 2002 — We brought the additional 75GB disk in typhoon live today, and gained another 60+ gigs of usable spool. This additional spool will help increase our binary retention as well as reducing IO contention on the other disks improving overall performance. -Kelsey and Nathan

Broadlink will be rebooting their head end…

July 31, 2002

Wed Jul 31 14:05:31 PDT 2002 — Broadlink will be rebooting their head end switch at 11:00 PM tonight. This will result in a few minutes of interruption for broadlink customers. – Russ

Our Pac West (530-xxx-0174) numbers started…

July 31, 2002

Wed Jul 31 13:18:13 PDT 2002 — Our Pac West (530-xxx-0174) numbers started returning ‘All Circuits Busy’ messages about 30 minutes ago, and we’ve tracked to an issue with the Telco. Pac West’s engineers have not given us an ETR, but I will keep this space updated. — Eli, Stephanie

Update: The problem is more widespread than initially perceived, and affects all of our xxx-0174 numbers served from Stockton. This is a good portion of Northern California (excluding the Bay Area), and we’re working with Pac West to get this repaired as quickly as possible.

Update, 15:05hrs: The problem was caused by an administrative fumble at Pac West. Our backhaul circuit was mistaken for another, and disconnected. Service is fully restored, and we’re in deep dialogue with Pac West. — Eli

We will be upgrading sonic.sonic.net, one of…

July 29, 2002

Mon Jul 29 21:00:35 PDT 2002 — We will be upgrading sonic.sonic.net, one of our core administrative servers tonight with new faster CPUs and additional RAM. We are also going to be performing some non-intrusive reconfiguration in our core network. We may, time allowing, also try to bring up an additional 75GB of spool up on news.sonic.net. We will start with sonic at Midnight. -Kelsey and Nathan

One of our frame relay circuits is…

July 26, 2002

Fri Jul 26 09:42:00 PDT 2002 — One of our frame relay circuits is experiencing problems. This is affecting a few of our FR customers. We are working with PacBell to get this issue resolved and hope to have it resolved shortly. – Steve and Matt K.

Update: The frame relay T1 came back up by itself, most likely caused by a something in PacBell’s network.

Our SLB mail cluster is still exhibiting odd…

July 25, 2002

Thu Jul 25 13:40:09 PDT 2002 — Our SLB mail cluster is still exhibiting odd behavior. We are seeing poor NFS performance, and as a result, we are also seeing delayed email. It is also possible that we may be refusing inbound SMTP connections as the load on the servers climb as ‘dirty’ process pile up. We are currently investigating the cause of this problem and hope to have it resolved shortly. -Kelsey and Nathan

Update: The problem appears to have resolved itself but we expect that in reality it will start up again tomorrow as load increases. We were not able to find anything wrong with our configuration and will continue to debug and troubleshoot once it starts again.

Update: The problem did not return today. We’ll continue to investigate the circumstances which cause our servers to get into trouble and to exhibit poor NFS performance. At this time we believe that the problems may have been caused by a remarkably aggressive Rumplestiltskin Attack.

shell.sonic.net (aka bolt) was not permitting

July 23, 2002

Tue Jul 23 13:23:48 PDT 2002 — shell.sonic.net (aka bolt) was not permitting dialup shell logins from our focal gear (numbers ending in 9811). After some investigation we found that the gear was on a new IP block that had not been authorized to rlogin into the shell server. We have added the IP block to the authorized IP list and all is working now. -Steve

Our SLB web and mail server clusters has been

July 22, 2002

Mon Jul 22 16:44:59 PDT 2002 — Our SLB web and mail server clusters has been exhibiting some odd behavior today. For short periods of time services are going unavailable do to some network issue. We are working to track down this problem as quickly as possible. -Kelsey and Nathan

Update: We tracked the problems down to the Ether Channel links going through our Alteon AD3 hardware load balancer. We are unclear why this was causing problems as it is a technology that we’ve been using for some time. We disabled all but one of the four trunked connections on both sides and everything is working fine.

ssl.sonic.net stopped responding to web…

July 21, 2002

Sun Jul 21 10:08:30 PDT 2002 — ssl.sonic.net stopped responding to web requests a few minutes ago. A brief investigation revealed that the apache webserver was wedged up on NFS, most likely a result of the mass migration. A reboot resolved the problem but we’ll be keeping a close eye on it in case it’s something else. -Kelsey

Night Operations Complete.

July 21, 2002

Sun Jul 21 07:24:36 PDT 2002 — Night Operations Complete. We’ve made a massive move of our core storage architecture and associated servers. The migration went very well, and downtime of most services was quite a bit shorter than planned. Servers were taken offline at 1:30am. Web hosting was down for a bit over an hour. Mail took longer due to the backlog of inbound and outbound mail, and took about three hours to complete. Peak load average observed on one of the mail servers during this time was 992.99 (compare to a typical of under 5.00). FTP was down for about an hour and a half, later in the morning.

The clock on our primary administrative box came up with the incorrect time, and executed some scheduled tasks far out of schedule. This resulted in some invoices for colocation, disk usage and bandwidth usage being run when they shouldn’t have. If you have an email in your inbox regarding billing for one of these type of services this AM, please disregard! No actual charges were made.

We had some unexpected problems with one of our nameservers, and a few other minor challenges, but things look quite healthy now and we’re very pleased with how smoothly this transition went. This completes the majority of our move into our new datacenter facility, and we’re very excited.

If you should observe any odd behavior, please post to news:sonic.net, or email support@sonic.net, or call support at 707-547-3400 and explain what you’re observing. Ask support not to wake us if it’s not a critical item. =)

Thanks to the team here for all of their help! While I might have to ask forgiveness for missing someone, here’s a list of the folks who worked on this tonight. Ops: Nathan, Kelsey, Scott, Russ, Steve, Matt and myself. Techs: ChrisM, Jeff, Aaron, MattS, ScottB, ChrisB, Dan, Kavan, Bryan. Guest helpers: JenM DustinM

-Dane (very happy to be almost completely moved, and looking forward to my vacation which begins on Wednesday.)