Category: Uncategorized

Backup power systems online

Update :  Power was restored yesterday (14 April) around 5:55pm and all systems are normal.  –Augie

Sonic.net’s Santa Rosa datacenter and offices are currently running on our power backup system due to a PG&E utility outage.  All backup systems are operating as designed, and there is no customer impact.

It’s rare that we have an actual utility power outage here, so most of our use of the backup systems is weekly, semi-annual and annual testing and maintenance.  It is very interesting to see the office, darker than usual (only limited lighting is on), but otherwise functioning as usual.  Technical support PCs are online, our phone system is online, and we are providing customer service as usual.

I’d like to offer thanks and congratuations to the team here who put together our backup systems.  Russ Irving, Kelsey Cummings, Nathan Patrick, Juston Pierce and former employees Matt Kirk and John Harkin.  Thanks team!  -Dane

AT&T ATM Outage

At approximately 2:30 AM today, AT&T lost ATM connectivity to the Santa Cruz area, causing all DSL in that area to be non-functional. AT&T has no ETR at this time, but we are in contact with them and will update as soon as they have any new information.

-Jared and Nathan

Update: AT&T has an official ETR on this outage of 8 PM today. We will continue to update as we get more information from AT&T on this outage. This news article provides additional information on the nature and cause of the AT&T outage: http://www.digitalnewsreport.com/2009/04/phone-internet-outage-in-santa-clara-santa-cruz/1302

Update: As of 2:40 PM we started seeing customers affected by the AT&T fiber cut come back online, and currently the live customer count is steadily increasing.
Update: At this time, AT&T reports that the fiber cut that caused this outage has been repaired, and all affected customers should be back online.

Intermittent DHCP Issue

At 11 AM today we proactively failed out of service a DHCP server that serves DHCP to some DSL customers in the Bay Area and Sacramento area due to a RAID disk failure. The DHCP traffic fell back to our backup DHCP server. Unbeknownst to us, the backup DHCP server had a hardware issue that was causing it to respond to DHCP lease requests slowly, thus causing intermittent DHCP service to the affected customers. We have restored the primary DHCP server while we diagnose and repair the backup server. We apologize for any interruption of service that this DHCP issue caused.

-Jared, Nathan, Jasper, Kavan and Kim

Emergency Router Maintenance

At 12:00PM this afternoon we will be performing an emergency router reload on one of our ATM customer aggregation routers. All connected Business-T and FRATM customers will experience approximately 5 minutes of downtime during the reload.

-Tim and Dusty

Fresno ATM Backhaul Issue

At 1:08 AM today, our ATM backhaul circuit to the Fresno area went down unexpectedly. We immediately began diagnosing and troubleshooting the circuit with AT&T, but before we could isolate the problem, the circuit came back up, approximately 10 minutes later.

We are currently monitoring the circuit with AT&T, and apologize for any inconvenience this brief outage caused.

-Jared

PHP url_include removal

Monday March 30th we will be disabling the url_include ability in our default PHP setup in order to improve the security of our web cluster. This ‘feature ‘ of PHP is frequently misused by web developers and is the by far the most common vector used by hackers to gain access to exploit customer websites. Web hosting customers that require this functionality have several options to either work around or re-enable it presented in further detail in a FAQ at http://www.sonic.net/support/faq/advanced/url_include.shtml

If you think you may be using this feature we urge you to review your php code before March 30th and make any necessary changes to ensure that you will not be affected.

Update: We have completed the changes to php on our web cluster. Please note that at this time these changes only affect customers using the default php configuration.

Emergency Router Maintenance

At 12:00PM this afternoon we will be performing an emergency router reload on one of our ATM customer aggregation routers in Los Angeles. All connected Business-T and FRATM customers will experience approximately 5 minutes of downtime during the reload.
-Jared, Tim and Dusty

Router Software Upgrade

This Thursday, March 5 at 12:01 AM, we will be performing a software upgrade on one of our T1 customer aggregation routers in San Francisco. The software upgrade will require a reboot of the router, resulting in approximately 5 minutes of downtime for customers served by this router.
-Jared

Update: The software upgrade was completed without incident. Total downtime was 5 minutes for all customers served by this router.

Router Software Upgrade

This Tuesday, March 3 at 12:01 AM, we will be performing software upgrades on two of our T1 and WBA customer aggregation routers in Santa Rosa. The software upgrade will require a reboot of the routers, resulting in approximately 5 minutes of downtime for customers served by these routers. These routers serve T1 customers in the Santa Rosa area, and a small subset of our WBA customers.

-Jared

The router software upgrades have been completed. Of the two routers being upgraded, one went smoothly, but the other had some issues properly loading the new software. This router was offline for approximately 20 minutes across 3 intervals while we worked to get the new software loaded. At this time, both routers are fully upgraded and appear to be operating normally. We apologize for the extra downtime incurred during this operation.

Web and MySQL failure

This morning around 6:30 AM one of the servers in our web cluster locked up and became unavailable. Whilst en route to investigate the problem two more servers in the cluster locked up and became unavailable. Customers may have noticed decreased performance with the cluster but should not have suffered any actual down time. Unfortunately due to human error, during the restoration of service we also had one of our MySQL servers go down because someone unplugged the wrong server. 🙁

It went something like this 🙂 . We apologize for any inconvenience and are trying to determine the cause to prevent future failures of this nature.

-William and the SOC