Fri Dec 29 17:51:23 PST 2000 — This afternoon, we’ve isolated the issue with the Radius code and fixed it, the last few hours of monitoring show that it is working well. We will deploy the two new Authentication servers next week after more development, but the existing configuration should serve well until the servers are ready. – Russ, Kelsey, Eli
We had a carrier transition on our T3 to…
Wed Dec 27 15:42:10 PST 2000 — We had a carrier transition on our T3 to UUNet, causing a few minutes of network instability while the Cable & Wireless T3 took up the load- The redundant configuration of our network prevented this from being a serious issue. – Eli
Night Ops Complete: Our NetApps, freezer and…
Thu Dec 21 03:20:29 PST 2000 — Night Ops Complete: Our NetApps, freezer and icebox, are now configured in a cluster such that if a head unit fails, the other will seamlessly take over it’s NFS duties. Like the Alteon’s load balancing of all email, ftp and web, we now have redundancy on the NFS back-end. Thanks to Dan from NetApp we have also thoroughly tested and proven that the cluster fail-over works properly.
It should be noted that both NetApps had to be halted during the installation of the additional hardware needed to support clustering. During this time, inbound email queued locally on each mail server, but pop services were offline, along with web, ftp and shell. The total service outage extended from 12:15 AM to about 1:00 AM.
Our two new Cisco routers are on site and we’re preparing to migrate to an active-active dual router configuration using Cisco’s HSRP (similar to VSRP) protocol. Once we have finished the migration to the new Ciscos we will have full end-to-end redundancy in our core network and for all of our core services.
We also replaced our Redback’s SMS 1000 loaner with our new SMS 1800 which has a much greater capacity for expansion over our old SMS 500. The SMS terminates all PacBell and Broadlink DSL service on our network.
Lily, the T3 MUX, had it’s primary controller restored (from the last night ops.) So once again, lily is internally redundant. We also completed some routine maintenance and reorganization of our NOC and some of our core servers.
-Kelsey, Steve, Nathan, Russ, Matt, Jared, Jeff, and the guys from NetApp.
We will be installing our new netapp in a…
Wed Dec 20 13:46:06 PST 2000 — We will be installing our new netapp in a redundant configuration, we have the netapp engineers here to help us get this up with minimal down time. There may be a short interruption of local services between 12:00am and 1:30am tonight. Also we will be upgrading our SMS1000 to a new SMS1800, this is our Redback DSL router so DSL service will be interrupted for about 15 minutes at 1:30am tonight. Thanks -Kelsey, Steve, Jeff and the NetApp crew.
Our shell server was refusing connections…
Tue Dec 19 09:20:03 PST 2000 — Our shell server was refusing connections this am, restarting ssh fixed the problem. We will be keeping an eye on the box to make sure this does not happen again. -Steve and Matt
Our 1003 dialup group started to return…
Mon Dec 18 22:17:16 PST 2000 — Our 1003 dialup group started to return intermittent ‘All circuits are busy’ messages. It required a reboot of one of our NAS servers. This caused a small amount of people to be bumped off. After the reboot the error message went away. -Steve and Eli
Our T1 link to Sebastopol had problems around
Mon Dec 18 02:49:24 PST 2000 — Our T1 link to Sebastopol had problems around 2:30am this morning, and a power cycle of the CSU/DSU in Sebastopol was required to bring it back online. Remote power management via dialup backup connection was used to effect a quick resolution without requiring a site visit to Sebastopol. -Dane and Scott
The RedBack SMS router that serves PacBell…
Fri Dec 15 14:14:16 PST 2000 — The RedBack SMS router that serves PacBell DSL and Broadlink WDSL customers was rebooted a few minutes ago. It appears that we’ve found a bug in the Redback’s firmware, a typographical error while configuring the machine caused it to crash, which is obviously not supposed to happen. We’ll be researching this with Redback. – Eli (who promises to type more slowly)
Bolt.sonic.net has been issued a shutdown.
Thu Dec 14 12:20:44 PST 2000 — Bolt.sonic.net has been issued a shutdown. It has become unresponsive to console administration due to the dreaded ‘no more processes’ available (or ‘fork: Try again’) errors. This is caused by a large number of zombied and detached processes, primarily sshd, that are no longer active. We are looking into the problem and have been considering upgrading to OpenSSH now that the RSA patent has expired. Shell services should be restored in a few minutes. -Kelsey, Eli, and Steve.
We’ve had a brief service interruption, but…
Thu Dec 14 14:09:10 PST 2000 — We’ve had a brief service interruption, but normal service has been restored. During the shutdown of our Shell Server, one of our NetApp filers became unresponsive. This delayed Mail Services, as the cluster of mail servers were unable to access their mail spools. The unresponsive behavior of the filer has been traced to a bug in the NetApp’s OS, which we resolved with the help of NetApp’s engineers. We apologize for the delay in email delivery, and want to stress that no mail was lost during this outage. Furthermore, we had already scheduled deployment (again with the help of NetApp) of our dual NetApp filers into an ‘active-active’ redundant configuration, which would have mitigated this outage as well. — Eli, Kelsey, Steve, Matt (and everybody else who worked as fast as possible to resolve this)