This evening, at 10:00PM, we will be performing a software upgrade on the DSLAM that serves Fusion customers in the Rincon Valley area of Santa Rosa. Expected downtime is less than 15 minutes while the DSLAM is rebooted onto the new software release.
-Tim and Nathan
Things did not go as planned. The upgrade failed to import a wide variety of important customer settings, causing us to attempt our pre-scripted roll-back procedure to undo the software upgrade. That process went even worse, and caused our DSLAM to forget a large chunk of even more important stuff. What was left was severely corrupted.
We keep a full library of historical device configurations, so the logical course of action was to re-program the device from one of those saved copies. This operation didn’t work. We thought it was a version mismatch problem between the saved copies (they’re in binary — PLEASE, device vendors, don’t keep your configurations in binary!) and the exact software load we were attempting to restore on. We tried 4 or 5 different combinations. Nothing worked.
Typically, our devices are provisioned by automated systems. Due to changes wedged into this code, our automated systems don’t quite know how to talk to the new version properly, so the automation was next to useless. In the end, we re-configured the whole device by hand on the code we were attempting to upgrade to.
Despite the saga above, this particular issue affected less than 20 of our customers. Our sincere apologies to those folks, who experienced an outage from around 10:30pm until 1:30am or so. We’ll be hammering out these issues with our equipment vendor to ensure this doesn’t happen again.
-Nathan + Matt and Jared for moral support