Fri Dec 29 11:37:09 PST 2000 — Update on Napa dialup issue. PacBell re-seated the switch module in the Napa CO in turn our Napa main pop is back up and running. – Steve
Fri Dec 29 11:18:55 PST 2000 — We’re having ongoing problems this morning with our primary authentication server. About 20% of the time, it’s failing to authenticate customer logins. This affects dialup, mail and shell access, as well as web based member tools. We are working to reduce the workload on the primary server in an attempt to resolve this problem, but haven’t had much luck. This issue has been slowly building for a couple weeks, and is quite a bit worse today.
Kelsey, Eli, Steve and Russ are working to bring online two new authentication servers that have been in the works for some time. The two new machines will be much, much faster then the current configuration, and will be load balanced by the Alteon L4 switches for full redundancy. We’re hoping to wrap this up late this afternoon, but the final deployment may end up happening this weekend due to testing overhead.
If you have authentication failures, please do simply try again. We’re sorry for the inconvenience this causes! If you find that after multiple attempts, you still cannot access the service, please contact support ASAP at 707-547-3400 and them know. -Dane
Fri Dec 29 09:46:16 PST 2000 — PacBell reports a switch module has gone down in the Napa CO, this is causing our main Napa dialup number to return busy signals. PacBell has assured me that this is getting top priority and should be fixed shortly. Meanwhile we do have redundant dialup access for Napa. You can find an alternate dialup number by by checking our pop finder tool at www.sonic.net/cgi-bin/pops.pl or by calling tech support at 707-547-3400. – Steve
Fri Dec 29 08:39:43 PST 2000 — One of our radius servers was failing to authenticate mail requests this morning. After a quick restart it started working again. – Steve
Fri Dec 29 17:51:23 PST 2000 — This afternoon, we’ve isolated the issue with the Radius code and fixed it, the last few hours of monitoring show that it is working well. We will deploy the two new Authentication servers next week after more development, but the existing configuration should serve well until the servers are ready. – Russ, Kelsey, Eli
Wed Dec 27 15:42:10 PST 2000 — We had a carrier transition on our T3 to UUNet, causing a few minutes of network instability while the Cable & Wireless T3 took up the load- The redundant configuration of our network prevented this from being a serious issue. – Eli
Thu Dec 21 03:20:29 PST 2000 — Night Ops Complete: Our NetApps, freezer and icebox, are now configured in a cluster such that if a head unit fails, the other will seamlessly take over it’s NFS duties. Like the Alteon’s load balancing of all email, ftp and web, we now have redundancy on the NFS back-end. Thanks to Dan from NetApp we have also thoroughly tested and proven that the cluster fail-over works properly.
It should be noted that both NetApps had to be halted during the installation of the additional hardware needed to support clustering. During this time, inbound email queued locally on each mail server, but pop services were offline, along with web, ftp and shell. The total service outage extended from 12:15 AM to about 1:00 AM.
Our two new Cisco routers are on site and we’re preparing to migrate to an active-active dual router configuration using Cisco’s HSRP (similar to VSRP) protocol. Once we have finished the migration to the new Ciscos we will have full end-to-end redundancy in our core network and for all of our core services.
We also replaced our Redback’s SMS 1000 loaner with our new SMS 1800 which has a much greater capacity for expansion over our old SMS 500. The SMS terminates all PacBell and Broadlink DSL service on our network.
Lily, the T3 MUX, had it’s primary controller restored (from the last night ops.) So once again, lily is internally redundant. We also completed some routine maintenance and reorganization of our NOC and some of our core servers.
-Kelsey, Steve, Nathan, Russ, Matt, Jared, Jeff, and the guys from NetApp.
Wed Dec 20 13:46:06 PST 2000 — We will be installing our new netapp in a redundant configuration, we have the netapp engineers here to help us get this up with minimal down time. There may be a short interruption of local services between 12:00am and 1:30am tonight. Also we will be upgrading our SMS1000 to a new SMS1800, this is our Redback DSL router so DSL service will be interrupted for about 15 minutes at 1:30am tonight. Thanks -Kelsey, Steve, Jeff and the NetApp crew.
Tue Dec 19 09:20:03 PST 2000 — Our shell server was refusing connections this am, restarting ssh fixed the problem. We will be keeping an eye on the box to make sure this does not happen again. -Steve and Matt
Mon Dec 18 22:17:16 PST 2000 — Our 1003 dialup group started to return intermittent ‘All circuits are busy’ messages. It required a reboot of one of our NAS servers. This caused a small amount of people to be bumped off. After the reboot the error message went away. -Steve and Eli