Early this morning four of our DHCP servers started having issues responding quickly to DCHP requests. These simultaneous failures overwhelmed our ability to migrate load to our hot-standby servers but we were able to put several work-arounds in place to mitigate the issues. These failures were initially believed to be consistent with disk pre-failure scenarios where a single disk’s performance is impacted and the RAID system has yet to fail the disk. However, upon further investigation it was revealed that these failures were triggered by scheduled SMART tests. Ironically, the SMART tests were recently enabled to help us detect and replace failing disks before their failure triggered a service impacting event.
At this time, all DHCP services have been returned to normal.
-Nathan, Don, and Kelsey