Had an interesting issue today. One of the production systems suddenly went dark, and we found out about it from the client. This is never a good way to start a Thursday. It turns out that the client was having DNS issues and the domain was no longer valid. Relatively simple fix, crisis averted…
But why didn’t the monitoring system pick it up?
We use Dotcom-Monitor to check each of our sites on a regular basis. The monitor actually logs in to each website to verify functionality. What in the DNS world could cause this issue in such a scenario? How about a caching nameserver? Turns out, to limit the stress on their nameserver, Dotcom Monitor set up a standard caching nameserver that keeps a record in cache until the record expires. So even though DNS was no longer working for this site, the monitor thought everything was A-OK.
What can we do to fix this issue? Not much unfortunately. Dotcom Monitor will have to implement a change in their infrastructure which will likely increase the load on their DNS servers significantly. Since that’s not likely, it looks like I’ll have to build a service into our internal monitor (Zabbix based) to check for the domain against the SOA for it.