We are seeing an increase in in DNSSEC validation failures on our recursive dns servers. The cause has been identified as a security patch that was applied, which applied a stricter validation policy to domains with DNSSEC enabled. We are currently looking for ways to mitigate the problem.
Update: The cause of these problems has been positively identified as a behavior change that came along with a new version of ISC’s Bind which was was released two days ago in response to a collection of discovered potential security exploits in a group of CVEs. As always, we strive to push deploy security fixes in our network as quickly as possible and deployed this new version to all of our recursive DNS cluster backend servers over the course of the day starting Thursday AM. The problem specifically is the removal of SIG(0) combined with a change in behavior for what are seen as “invalid” DNSSEC keys resulting in these being treated as failures instead of being skipped. We’re currently stuck between a rock and a hard place, a known potential cache poisoning vulnerability, or a version which results in an unknown quantity of broken domains still relying on SIG(0). More updates forthcoming, we hope to have chosen a path forward to mitigate customer impact from this soon.
Update: We are in the process of rolling back the affected version across our name server clusters. It is our assessment that the additional complexity we believe is required for one of these potential cache poisoning attacks to succeed in our network justifies rolling back to the previous version rather than other choice which was to entirely disable DNSSEC until the issues with the new version could be resolved. For additional clarity this was originally brought to our attention by students and staff at usfca.edu who found they were unable to resolve usfca.edu domains this morning, we are not sure how many other affected domains there are or if this issues can rightly be blamed on the upstream DNS server administrators or not. The roll back should be completed shortly and we’re sorry for any confusion or trouble this may have caused you today. -Kelsey, Kevan and William
Update 2025-10-31: As it turns out this problem ended up being a combination of several issues and was actually related to zones that contained a deprecated DNSSEC key type (RSASHA1), even if they also had a valid key as well. This was additionally confused by RHEL’s security policy framework which triggered the new undesired behavior in Bind. We are in investigating several work around solutions for this but also expect that Bind will be releasing an update that corrects this behavior relatively soon.
-SOC