[Xymon] DNS failures causing "runtime longer than time limit"

Jeremy Laidman jlaidman at rebel-it.com.au
Mon Sep 11 05:52:46 CEST 2017


Hi

I'm reviving an old thread, because this is biting me again, so I wanted to
know if anyone had any fresh ideas on this problem.

Many of the servers I monitor are DNS servers, so the C-ARES library has a
lot of queries to perform every 5 minutes. In some cases, I want to ensure
that a DNS service is down (and alert when not) so most of the time I can
expect a timeout, leading to a long poll cycle. I'd really like to be able
to drop the timeout to significantly less than the 23 seconds it's taking
now per server.

Cheers
Jeremy


On 3 June 2015 at 13:49, Jeremy Laidman <jlaidman at rebel-it.com.au> wrote:

> OK, I'm a bit puzzled by this, and definitely pushing the envelope of my
> debugging and C coding skills.  The relevant code from xymonnet/dns.c is:
>
>     168                 tv.tv_sec = dnstimeout; tv.tv_usec = 0;
>     169                 tvp = ares_timeout(channel, &tv, &tv);
>
> I ran this through gdb, with "--dns-timeout=3" specified, setting a
> breakpoint at line 168.  I confirmed that dnstimeout is set to 3.  When I
> step one line, I should see tv.tv_sec set to 3 also, but it's set to 0.
>
> If I don't specify --dns-timeout at all, printing dnstimeout shows "30".
> Again, after stepping to the next line, tv.tv_sec is still zero.
>
> Breakpoint 1, dns_ares_queue_run (channel=0x58b1c0) at dns.c:168
> 168                     tv.tv_sec = dnstimeout; tv.tv_usec = 0;
> (gdb) p dnstimeout
> $14 = 30
> (gdb) n
> 169                     tvp = ares_timeout(channel, &tv, &tv);
> (gdb) p tv
> $15 = {tv_sec = 0, tv_usec = 0}
> (gdb)
>
> So what gives here?
>
> J
>
>
> On 3 June 2015 at 13:08, Jeremy Laidman <jlaidman at rebel-it.com.au> wrote:
>
>> Hi
>>
>> I'm running Xymon v4.3.10 on Linux, and I'm quite sure it's compiled with
>> c-ares support.
>>
>> I have 12 new DNS servers that were added to Xymon about one month ago.
>> All of my server entries in hosts.cfg have "testip".  The tasks.cfg runs
>> xymonet with "--dns-timeout=3".  The hosts entries look like so:
>>
>> 10.10.10.1 dnshost1.example.com    # testip dns=NS:example.com,SOA:
>> example.com
>>
>> About a week ago, connectivity to all of these servers failed, and at the
>> same time, the xymonnet run time jumped from less than 15 seconds to about
>> 330 seconds, so about 315 seconds extra.  The xymonnet page says 295
>> seconds is taken up by DNS tests.
>>
>> If the increase in time taken is about 315 and is entirely due to the 12
>> servers failing, then each failed server is adding about 26 seconds to the
>> total run time.
>>
>> I don't think this should be happening like this.  With two DNS checks
>> per server, the DNS checks should be taking 6 seconds each to time-out, not
>> 26.  If I run xymonnet with "--timing --no-update" and specify only one
>> hostname, I can view the results and the timing.  This shows that the ping
>> check gets reported after about 3 seconds, and then the DNS tests are
>> executed and take 26 seconds total.
>>
>> My naiive assumption was that when a server failed a ping (and didn't
>> have "noclear" defined in hosts.cfg) then the network checks would be
>> skipped.  On re-reading the man page for hosts.cfg, it dawned on me that a
>> failed ping simply suppresses failed test /results/, but doesn't stop the
>> tests from being run.
>>
>> So the real problem is that the "--dns-timeout=3" is not being taken into
>> consideration by xymonnet.  If I run xymonnet with "--debug" it tells me:
>>
>> 1900 2015-06-03 12:02:20 ares_search: tlookup='example.com', class=1,
>> type=2
>> 1900 2015-06-03 12:02:20 ares_search: tlookup='example.com', class=1,
>> type=6
>> 1900 2015-06-03 12:02:20 Processing 0 DNS lookups with ARES
>> 1900 2015-06-03 12:02:46 Finished ARES queue after loop 423
>>
>> This is peculiar.  Why would it say "processing 0 DNS lookups" when there
>> are two lookups to test?  Could this be because xymonnet hasn't actually
>> been built with ARES support and I didn't know it?  Is there a good way to
>> tell?  If I add "--no-ares" I get the same results perhaps suggesting a
>> lack of ARES support.  On the other hand, if I add "timeout:3" and
>> "attempts:1" into resolv.conf, I also get the same results.  If I run "nm
>> /path/to/xymonnet | grep gethostby" it returns "ares_gethostbyname".
>>
>> Just for fun, I compiled Xymon v4.3.21 and ran the xymonnet binary from
>> there, with no change in behaviour.  I also tried removing the
>> "--dns-timeout" option so that it defaults to 30 seconds, but still no
>> change - 26 seconds for two DNS tests.
>>
>> So, I'm not really sure what the problem is, but xymonnet certainly isn't
>> behaving as I would expect.
>>
>> Cheers
>> Jeremy
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xymon.com/pipermail/xymon/attachments/20170911/57c7d6f7/attachment.html>


More information about the Xymon mailing list