[Xymon] DNS failures causing "runtime longer than time limit"
Henrik Størner
henrik at hswn.dk
Mon Sep 11 07:11:56 CEST 2017
Hi,
I remember looking into this a long time ago, and the --dnstimeout
setting does not quite work as expected - because C-ARES does not quite
work as expected.
C-ARES has some timeout settings for queries, but it performs an
exponential back-off between queries, so it is impossible to really hit
the exact timeout you specify in --dnstimeout.
In fact, current 4.3.x versions have a hard-coded setting for the C-ARES
timeouts - it starts with a 2 second timeout and performs 4 attempts,
which ends up with approximately 23 second timeout for all DNS queries.
This is in xymonnet/dns.c (look for "ARES timeout"). If you need those
really short timeouts, then that is probably what you should change.
Regards,
Henrik
On 11-09-2017 05:52, Jeremy Laidman wrote:
> Hi
>
> I'm reviving an old thread, because this is biting me again, so I
> wanted to know if anyone had any fresh ideas on this problem.
>
> Many of the servers I monitor are DNS servers, so the C-ARES library
> has a lot of queries to perform every 5 minutes. In some cases, I want
> to ensure that a DNS service is down (and alert when not) so most of
> the time I can expect a timeout, leading to a long poll cycle. I'd
> really like to be able to drop the timeout to significantly less than
> the 23 seconds it's taking now per server.
>
> Cheers
> Jeremy
>
>
> On 3 June 2015 at 13:49, Jeremy Laidman <jlaidman at rebel-it.com.au
> <mailto:jlaidman at rebel-it.com.au>> wrote:
>
> OK, I'm a bit puzzled by this, and definitely pushing the envelope
> of my debugging and C coding skills. The relevant code from
> xymonnet/dns.c is:
>
> 168 tv.tv_sec = dnstimeout; tv.tv_usec = 0;
> 169 tvp = ares_timeout(channel, &tv, &tv);
>
> I ran this through gdb, with "--dns-timeout=3" specified, setting
> a breakpoint at line 168. I confirmed that dnstimeout is set to
> 3. When I step one line, I should see tv.tv_sec set to 3 also,
> but it's set to 0.
>
> If I don't specify --dns-timeout at all, printing dnstimeout shows
> "30". Again, after stepping to the next line, tv.tv_sec is still
> zero.
>
> Breakpoint 1, dns_ares_queue_run (channel=0x58b1c0) at dns.c:168
> 168 tv.tv_sec = dnstimeout; tv.tv_usec = 0;
> (gdb) p dnstimeout
> $14 = 30
> (gdb) n
> 169 tvp = ares_timeout(channel, &tv, &tv);
> (gdb) p tv
> $15 = {tv_sec = 0, tv_usec = 0}
> (gdb)
>
> So what gives here?
>
> J
>
>
> On 3 June 2015 at 13:08, Jeremy Laidman <jlaidman at rebel-it.com.au
> <mailto:jlaidman at rebel-it.com.au>> wrote:
>
> Hi
>
> I'm running Xymon v4.3.10 on Linux, and I'm quite sure it's
> compiled with c-ares support.
>
> I have 12 new DNS servers that were added to Xymon about one
> month ago. All of my server entries in hosts.cfg have
> "testip". The tasks.cfg runs xymonet with "--dns-timeout=3".
> The hosts entries look like so:
>
> 10.10.10.1 dnshost1.example.com <http://dnshost1.example.com>
> # testip dns=NS:example.com
> <http://example.com>,SOA:example.com <http://example.com>
>
> About a week ago, connectivity to all of these servers failed,
> and at the same time, the xymonnet run time jumped from less
> than 15 seconds to about 330 seconds, so about 315 seconds
> extra. The xymonnet page says 295 seconds is taken up by DNS
> tests.
>
> If the increase in time taken is about 315 and is entirely due
> to the 12 servers failing, then each failed server is adding
> about 26 seconds to the total run time.
>
> I don't think this should be happening like this. With two
> DNS checks per server, the DNS checks should be taking 6
> seconds each to time-out, not 26. If I run xymonnet with
> "--timing --no-update" and specify only one hostname, I can
> view the results and the timing. This shows that the ping
> check gets reported after about 3 seconds, and then the DNS
> tests are executed and take 26 seconds total.
>
> My naiive assumption was that when a server failed a ping (and
> didn't have "noclear" defined in hosts.cfg) then the network
> checks would be skipped. On re-reading the man page for
> hosts.cfg, it dawned on me that a failed ping simply
> suppresses failed test /results/, but doesn't stop the tests
> from being run.
>
> So the real problem is that the "--dns-timeout=3" is not being
> taken into consideration by xymonnet. If I run xymonnet with
> "--debug" it tells me:
>
> 1900 2015-06-03 12:02:20 ares_search: tlookup='example.com
> <http://example.com>', class=1, type=2
> 1900 2015-06-03 12:02:20 ares_search: tlookup='example.com
> <http://example.com>', class=1, type=6
> 1900 2015-06-03 12:02:20 Processing 0 DNS lookups with ARES
> 1900 2015-06-03 12:02:46 Finished ARES queue after loop 423
>
> This is peculiar. Why would it say "processing 0 DNS lookups"
> when there are two lookups to test? Could this be because
> xymonnet hasn't actually been built with ARES support and I
> didn't know it? Is there a good way to tell? If I add
> "--no-ares" I get the same results perhaps suggesting a lack
> of ARES support. On the other hand, if I add "timeout:3" and
> "attempts:1" into resolv.conf, I also get the same results. If
> I run "nm /path/to/xymonnet | grep gethostby" it returns
> "ares_gethostbyname".
>
> Just for fun, I compiled Xymon v4.3.21 and ran the xymonnet
> binary from there, with no change in behaviour. I also tried
> removing the "--dns-timeout" option so that it defaults to 30
> seconds, but still no change - 26 seconds for two DNS tests.
>
> So, I'm not really sure what the problem is, but xymonnet
> certainly isn't behaving as I would expect.
>
> Cheers
> Jeremy
>
>
>
>
>
> _______________________________________________
> Xymon mailing list
> Xymon at xymon.com
> http://lists.xymon.com/mailman/listinfo/xymon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xymon.com/pipermail/xymon/attachments/20170911/67010428/attachment.html>
More information about the Xymon
mailing list