[Xymon] DNS failures causing "runtime longer than time limit"

Henrik Størner henrik at hswn.dk
Mon Sep 11 07:11:56 CEST 2017


Hi,

I remember looking into this a long time ago, and the --dnstimeout 
setting does not quite work as expected - because C-ARES does not quite 
work as expected.

C-ARES has some timeout settings for queries, but it performs an 
exponential back-off between queries, so it is impossible to really hit 
the exact timeout you specify in --dnstimeout.

In fact, current 4.3.x versions have a hard-coded setting for the C-ARES 
timeouts - it starts with a 2 second timeout and performs 4 attempts, 
which ends up with approximately 23 second timeout for all DNS queries. 
This is in xymonnet/dns.c (look for "ARES timeout"). If you need those 
really short timeouts, then that is probably what you should change.


Regards,
Henrik


On 11-09-2017 05:52, Jeremy Laidman wrote:
> Hi
>
> I'm reviving an old thread, because this is biting me again, so I 
> wanted to know if anyone had any fresh ideas on this problem.
>
> Many of the servers I monitor are DNS servers, so the C-ARES library 
> has a lot of queries to perform every 5 minutes. In some cases, I want 
> to ensure that a DNS service is down (and alert when not) so most of 
> the time I can expect a timeout, leading to a long poll cycle. I'd 
> really like to be able to drop the timeout to significantly less than 
> the 23 seconds it's taking now per server.
>
> Cheers
> Jeremy
>
>
> On 3 June 2015 at 13:49, Jeremy Laidman <jlaidman at rebel-it.com.au 
> <mailto:jlaidman at rebel-it.com.au>> wrote:
>
>     OK, I'm a bit puzzled by this, and definitely pushing the envelope
>     of my debugging and C coding skills.  The relevant code from
>     xymonnet/dns.c is:
>
>         168                 tv.tv_sec = dnstimeout; tv.tv_usec = 0;
>         169                 tvp = ares_timeout(channel, &tv, &tv);
>
>     I ran this through gdb, with "--dns-timeout=3" specified, setting
>     a breakpoint at line 168.  I confirmed that dnstimeout is set to
>     3.  When I step one line, I should see tv.tv_sec set to 3 also,
>     but it's set to 0.
>
>     If I don't specify --dns-timeout at all, printing dnstimeout shows
>     "30".  Again, after stepping to the next line, tv.tv_sec is still
>     zero.
>
>     Breakpoint 1, dns_ares_queue_run (channel=0x58b1c0) at dns.c:168
>     168                     tv.tv_sec = dnstimeout; tv.tv_usec = 0;
>     (gdb) p dnstimeout
>     $14 = 30
>     (gdb) n
>     169                     tvp = ares_timeout(channel, &tv, &tv);
>     (gdb) p tv
>     $15 = {tv_sec = 0, tv_usec = 0}
>     (gdb)
>
>     So what gives here?
>
>     J
>
>
>     On 3 June 2015 at 13:08, Jeremy Laidman <jlaidman at rebel-it.com.au
>     <mailto:jlaidman at rebel-it.com.au>> wrote:
>
>         Hi
>
>         I'm running Xymon v4.3.10 on Linux, and I'm quite sure it's
>         compiled with c-ares support.
>
>         I have 12 new DNS servers that were added to Xymon about one
>         month ago.  All of my server entries in hosts.cfg have
>         "testip".  The tasks.cfg runs xymonet with "--dns-timeout=3".
>         The hosts entries look like so:
>
>         10.10.10.1 dnshost1.example.com <http://dnshost1.example.com>
>         # testip dns=NS:example.com
>         <http://example.com>,SOA:example.com <http://example.com>
>
>         About a week ago, connectivity to all of these servers failed,
>         and at the same time, the xymonnet run time jumped from less
>         than 15 seconds to about 330 seconds, so about 315 seconds
>         extra.  The xymonnet page says 295 seconds is taken up by DNS
>         tests.
>
>         If the increase in time taken is about 315 and is entirely due
>         to the 12 servers failing, then each failed server is adding
>         about 26 seconds to the total run time.
>
>         I don't think this should be happening like this.  With two
>         DNS checks per server, the DNS checks should be taking 6
>         seconds each to time-out, not 26.  If I run xymonnet with
>         "--timing --no-update" and specify only one hostname, I can
>         view the results and the timing. This shows that the ping
>         check gets reported after about 3 seconds, and then the DNS
>         tests are executed and take 26 seconds total.
>
>         My naiive assumption was that when a server failed a ping (and
>         didn't have "noclear" defined in hosts.cfg) then the network
>         checks would be skipped. On re-reading the man page for
>         hosts.cfg, it dawned on me that a failed ping simply
>         suppresses failed test /results/, but doesn't stop the tests
>         from being run.
>
>         So the real problem is that the "--dns-timeout=3" is not being
>         taken into consideration by xymonnet.  If I run xymonnet with
>         "--debug" it tells me:
>
>         1900 2015-06-03 12:02:20 ares_search: tlookup='example.com
>         <http://example.com>', class=1, type=2
>         1900 2015-06-03 12:02:20 ares_search: tlookup='example.com
>         <http://example.com>', class=1, type=6
>         1900 2015-06-03 12:02:20 Processing 0 DNS lookups with ARES
>         1900 2015-06-03 12:02:46 Finished ARES queue after loop 423
>
>         This is peculiar.  Why would it say "processing 0 DNS lookups"
>         when there are two lookups to test?  Could this be because
>         xymonnet hasn't actually been built with ARES support and I
>         didn't know it?  Is there a good way to tell?  If I add
>         "--no-ares" I get the same results perhaps suggesting a lack
>         of ARES support.  On the other hand, if I add "timeout:3" and
>         "attempts:1" into resolv.conf, I also get the same results. If
>         I run "nm /path/to/xymonnet | grep gethostby" it returns
>         "ares_gethostbyname".
>
>         Just for fun, I compiled Xymon v4.3.21 and ran the xymonnet
>         binary from there, with no change in behaviour.  I also tried
>         removing the "--dns-timeout" option so that it defaults to 30
>         seconds, but still no change - 26 seconds for two DNS tests.
>
>         So, I'm not really sure what the problem is, but xymonnet
>         certainly isn't behaving as I would expect.
>
>         Cheers
>         Jeremy
>
>
>
>
>
> _______________________________________________
> Xymon mailing list
> Xymon at xymon.com
> http://lists.xymon.com/mailman/listinfo/xymon

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xymon.com/pipermail/xymon/attachments/20170911/67010428/attachment.html>


More information about the Xymon mailing list