DSL Performance Problems in LATA1

This morning at about 9:30 AM we began tracking a problem that was causing some DSL customers in the Bay Area to experience poor speeds and performance. We have tracked the cause of the problem back to a failing ATM card in one of our core routers, and have removed that card from service. At this time, all DSL should be functioning normally, but we continue to monitor the situation.

-Jared and Nathan

9 comments for “DSL Performance Problems in LATA1

  1. Dear Jared and Nathan,

    This morning, about 9:40AM PST, my home business stopped when Sonic.net delivered almost no bandwidth over the dsl line.

    OK, this sort of thing happens. After running diagnostics at my home, I called Sonic.net support line.

    After 10 minutes of music-on-hold, I got through to Mark at 10:05 AM PST. He walked me through the usual suspects, and told me repeatedly that all his diagnostics indicated a completely normal connection between Sonic.net and my computer.

    Mark asked me to remove my router and connect my computer directly to the modem; I did so, with the same result. I ran an ethernet cable to my daughter’s bedroom to test her computer – same result.

    20 minutes later, I called back, and listened to music on hold for another 35 minutes.

    Suddenly, the DSL line came back – at 11:35AM. Soon after the problem was solved, the posting on Sonic.net status page reported that the problem had been fixed. Yippee!

    BUT WAIT! SHOULDN’T THE PROBLEM HAVE BEEN POSTED TO THE STATUS PAGE THE MOMENT IT WAS DISCOVERED? Worse, Mark and the rest of the support staff were unaware of this problem while your staff was working to solve it. As a result, I wasted an hour dragging ethernet cables across my house.

    PLEASE – DON’T WAIT UNTIL THE PROBLEM IS SOLVED TO POST A NOTICE THAT THERE’S A PROBLEM.

    Suggestion 1) Even if you cannot fix something immediately, it’s important to post a notice (with date & time) to your web page to say that you’re at least aware of the problem.

    Suggestion 2) Don’t keep these things a secret from your support staff.

    Suggestion 3) When there’s an outage, please replace the music-on-hold with details of what’s wrong.

    Suggestion 4) Include the time zone of each posting — clearly your posting of November 10, 2009 – 1:57pm is not sonic.net’s home time zone. Maybe it’s Eastern Daylight Time.

    -Cliff Stoll (in Oakland)

    > DSL Performance Problems in LATA1
    > November 10, 2009 – 1:57 pm by jared
    >
    > This morning at about 9:30 AM we began tracking a problem that was causing some DSL customers in the Bay Area
    > to experience poor speeds and performance.
    > We have tracked the cause of the problem back to a failing ATM card in one of our core routers,
    > and have removed that card from service. At this time, all DSL should be functioning normally,
    > but we continue to monitor the situation.>
    > -Jared and Nathan

  2. Cliff,

    I appreciate your feedback on this issue, and apologize for the late communication on this issue. Let me address a few of your concerns:

    First, our support staff was definitely not in the dark about this issue. They are one of our first lines of alert of strange problems like this. They were the ones who correlated the high volume of calls to a widespread performance issue and narrowed it down to a particular region in our network. If you were one of the first people to call in on this issue, they would have had to troubleshoot your problem in the usual manner in order to determine whether or not you were part of a larger pattern.

    Secondly, we do try to post an MOTD of some sort, and put a “red alert” message on our phone system in the event of problems like this. I’m not positive why the red alert was not put up, but I can say that the majority of the people who are involved in such a thing were heavily committed to tracking down and fixing this problem.

    Finally, the timestamp on the MOTD is incorrect due to the timezone being improperly set on our MOTD server. This should be fixed shortly. In general, all of our publicized times and dates should be Pacific time.

    -Jared

  3. Thanks for your note, Jared.

    The red alert seemed not to happen today … the first mention in the MOTD is your note announcing that the problem has been fixed.

    Another suggestion: Please consider mirroring sonic.net status reports and message-of-the-day notes to a mirror URL that’s far from sonic.net.

    Of course, everyone trusts that sonic.net will never be disconnected from the net. But in the infinitesimal chance that this happens, it would be extremely useful to have a pre-arranged offsite URL to find details and recovery progress.

    Thanks!
    -Cliff

  4. I called this morning about this, and speed is still very erratic according to dslreports.com speed tests I’m running. I have ‘pro’ level DSL, and after I’d upgraded to it a week ago, I was consistently getting 2.5Mbps. Now it will jump up to 1.5Mbps (my old, pre-pro speed) then down to 0.4Mbps, and all around that range.

    I hope the ticket I opened about this hasn’t been dropped because it was assumed that the above problem was the cause of it, because it’s still bad.

    – Tim

  5. Tim:

    If a trouble ticket is opened for a customer, even during a widespread issue like this one, our support staff will always follow up after the fact to find out if your problem was resolved as part of the widespread issue or not. I just looked into your ticket and it looks like you and our support staff are continuing to work your issue.

    -Jared

  6. I had called in because my speed had dropped from 2560 kbps (3 Mbps account)to 100-400 kbps that morning. The technician said my line had no problems and he had me bypass my router and swap modems with no difference in results. I have an iPod touch that uses the wireless part of my router. That iPod has a speedtest app and the same results rule out anything wrong with the PC. I’m surprised that such a system problem cannot be picked up at the support end.

  7. Hello Gene. Unfortunately it can take a little time to correlate problems indicating an outage. Amusingly the technician you spoke to had in his note this comment; “There seem to be a few performance issues happening at this time, keeping an eye out for possible outages.”

    Thank you for being one of the early reporters, it does look like your contact helped us track down the issue.

  8. Gene:

    As I mentioned above, Support is one of our best barometers of any network problems, and they were instrumental in identifying this problem and helping us isolate it to a particular piece of equipment. Our support staff correlates trends in calls to determine if there may be a widespread issue, and escalates if need be. If you are one of the first people to call in regarding an issue, you may not be told that you are part of a widespread issue, because we haven’t reached that conclusion yet, but your call is instrumental in helping us identify and thus resolve your problem.
    We do have extensive monitoring systems to help us identify and resolve problems, but no matter how good a monitoring system you have, there will always be some number of failures that it cannot detect. A core router interface silently dropping/mangling packets is a failure mode that is quite difficult to detect automatically.

  9. Thanks jared and kevan. That explains it. Like health problems,network problems can take a while to diagnose depending on complexity.

Leave a Reply

Your email address will not be published. Required fields are marked *

*