Update Sat Apr 19 00:30:38 PDT 2008 — ATM Switch Maintenance Complete. We have completed the replacement of the backup management card in our ATM switch and all went as expected. There should have been no customer impact, and traffic levels look normal on the switch. -Jared and Matt
Month: April 2008
ATM Switch Maintenance.
Fri Apr 18 11:29:19 PDT 2008 — ATM Switch Maintenance. Tonight at midnight, we will be replacing the management card in our ATM switch that started throwing errors on 4/2/08. This card is in a backup role on the switch and replacing it should not cause any customer impact. However, in the event of a failure, DSL service to customers on this ATM switch may experience up to 5 minutes of downtime. Potentially affected customers are DSL customers in the greater Bay Area, Chico, Stockton, and San Diego areas. -Jared and Matt
News Server Outage.
Thu Apr 17 12:21:00 PDT 2008 — News Server Outage. news.sonic.net is currently offline due to back to back disk failures in the redundant load balanced pair of servers that handle news readers services. At this time it is unclear how long it will take to restore news reader services.
Failure One: Last night nnrp1 (the news reader slave server) was taken off-line due to a disk failure. Users may have seen poor news performance if their streams were on this cluster until it was taken off-line. The failed disk was replaced this morning with a spare and work was started to rebuild it and bring it back into service. Failure Two: The news server systems have removable face plates that cover all of the drive bays. In order to confirm the correct configuration of the drives in nnrp1, the faceplate for nnrp0 was removed to expose the hot swap disk carriers. At this point, nnrp0 (the now single, and master news reader server) unexpectedly powered off. Upon reboot, one of it’s news related disks has also hard failed.
We will continue efforts to restore news services as quickly as possible. In any case, it will take at least the remainder of the day to restore services and is likely that article numbering will be lost. -Kelsey
Update:: Thu Apr 17 15:31:16 PDT 2008 — We have scavenged parts from nnrp1 to rebuild nnrp0 and are currently restoring a backup that we hope will allow us to retain article numbering across all but our local sonic.* groups. Unfortunately, it is going to take several more hours before the data restoration is complete. If this fails, we will bring the server back up with all groups starting at zero. -Kelsey
Update:: Thu Apr 17 21:21:35 PDT 2008 — news.sonic.net is back online. We were able to restore nnrp0 with article numbering more or less intact but did lose numbering of our local groups as expected. At this time our spoolers and extra header feeds and busy pushing their backlogged header feeds in. Time will tell how big of a gap remains once it has caught up. Some users may find that they have to unsubscribe/subscribe from our local sonic.* news groups in order to see new news or force their news readers to do a full refresh of the group’s headers. -Kelsey
Los Angeles DHCP Failure.
Sat Apr 12 09:08:51 PDT 2008 — Los Angeles DHCP Failure. One of our DHCP servers in our Los Angeles PoP suffered a failure that prevented it from providing DHCP leases at approx. 2:30 AM today. This problem would have interrupted connectivity for dynamic IP DSL subscribers in the Los Angeles area until the problem was resolved 6 hours later at 8:40 AM. We apologize for the outage, and will be investigating the root cause of the problem our DHCP server experienced. -Jared and Tim
Custsql MySQL replication upgrade tonight.
Tue Apr 8 22:02:22 PDT 2008 — Custsql MySQL replication upgrade tonight. This evening, at 10:00PM PDT, I will be performing this upgrade which will make our system much more robust and add yet another layer of data protection to your databases! For a very short period, you may experience your sessions timeout; they will be back shortly. -Don
@Mail Webmail Service Interruption.
Mon Apr 7 23:29:19 PDT 2008 — @Mail Webmail Service Interruption. Earlier this evening some routine server maintenance caused @Mail to go down unexpectedly. We are sorry for the delays in returning it to service; it took an unusual quantity of sleuthing to find the errors introduced last week leading to the failure tonight. At this time @Mail is fully operational. -Kelsey, Don and Augie
Emergency ATM Switch Maintenance.
Update Thu Apr 3 02:22:47 PDT 2008 — Emergency ATM Switch Maintenance. When we performed our card reload earlier tonight, we apparently hit a bug which caused an interface to stop forwarding ATM cells, but appear to be functioning perfectly. This caused a loss of connectivity for DSL some subs in LATA1 with odd-numbered VPs. Upon investigation of the bug, the only way to resolve it was to reload the entire ATM switch, which took place at 2:15 AM, and caused a 5-minute outage for all DSL subs connected to it. This includes a large amount of LATA1, as well as all of Chico, San Diego, and Stockton. Currently all traffic levels look normal on all interfaces, and we will be closely monitoring this switch for the near future. -Jared and Nathan
ATM Switch Maintenance Complete.
Update Thu Apr 3 00:40:57 PDT 2008 — ATM Switch Maintenance Complete. We have completed our reload of the affected card and everything appears to be operating normally. There should have been no customer impact. -Jared
ATM Switch Maintenance.
Wed Apr 2 10:40:07 PDT 2008 — ATM Switch Maintenance. A backup management card in one of our ATM switches began experiencing errors this morning. Tonight at midnight we will be reloading the card to clear these errors. Since this card is the backup redundant management card, there should be no customer impact during the reload. -Jared and Nathan