[Xymon] dns-timeouts problem when 1 dns-server down - bbtest-net not running in parralel?

Carl Melgaard Carl.Melgaard at STAB.RM.DK
Fri May 13 14:31:44 CEST 2011


Hi,

I'm trying to figure out why Xymon adds alot of time to dns-tests when 1 of our test-DNS-servers are down. From taking 1 second to run it goes to a whopping 450 seconds.

It seems theres a 30-second delay (bbtest-net dns-timeout standard) - but its repeated multiple times in serial - and bbtest-net waits and waits (12 times x 30 seconds) - for a total of 450 seconds to complete 455 resolved hostnames.

Heres a snapshot stracing bbnet-test:

select(5, [4], [], NULL, {16, 789000})  = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {347013, 721246198}) = 0
sendto(4, "qd\1\0\0\1\0\0\0\0\0\0\<server>"..., 36, MSG_NOSIGNAL, NULL, 0) = 36
clock_gettime(CLOCK_MONOTONIC, {347013, 721471198}) = 0
select(5, [4], [], NULL, {30, 0})       = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {347043, 721050198}) = 0
clock_gettime(CLOCK_MONOTONIC, {347043, 721110198}) = 0
select(5, [4], [], NULL, {30, 0})       = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {347073, 721684198}) = 0
clock_gettime(CLOCK_MONOTONIC, {347073, 721744198}) = 0
select(5, [4], [], NULL, {30, 0})       = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {347103, 721863198}) = 0
clock_gettime(CLOCK_MONOTONIC, {347103, 721925198}) = 0
select(5, [4], [], NULL, {30, 0})       = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {347133, 723122198}) = 0
sendto(4, "qd\1\0\0\1\0\0\0\0\0\0\<server>"..., 36, MSG_NOSIGNAL, NULL, 0) = 36
clock_gettime(CLOCK_MONOTONIC, {347133, 723301198}) = 0
select(5, [4], [], NULL, {30, 0})       = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {347163, 724742198}) = 0
clock_gettime(CLOCK_MONOTONIC, {347163, 724804198}) = 0
select(5, [4], [], NULL, {30, 0})       = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {347193, 725228198}) = 0
clock_gettime(CLOCK_MONOTONIC, {347193, 725289198}) = 0
select(5, [4], [], NULL, {30, 0})       = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {347223, 726443198}) = 0
clock_gettime(CLOCK_MONOTONIC, {347223, 726504198}) = 0
select(5, [4], [], NULL, {30, 0})       = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {347253, 728061198}) = 0
clock_gettime(CLOCK_MONOTONIC, {347253, 728120198}) = 0
select(5, [4], [], NULL, {30, 0})       = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {347283, 728632198}) = 0
clock_gettime(CLOCK_MONOTONIC, {347283, 728693198}) = 0
select(5, [4], [], NULL, {30, 0})       = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {347313, 728263198}) = 0
clock_gettime(CLOCK_MONOTONIC, {347313, 728323198}) = 0
select(5, [4], [], NULL, {30, 0})       = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {347343, 728968198}) = 0
clock_gettime(CLOCK_MONOTONIC, {347343, 729027198}) = 0
select(5, [4], [], NULL, {29, 995000})  = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC, {347373, 725591198}) = 0
close(4)                                = 0
clock_gettime(CLOCK_MONOTONIC, {347373, 725764198}) = 0
clock_gettime(CLOCK_MONOTONIC, {347373, 725982198}) = 0
open("/etc/resolv.conf", O_RDONLY)      = 4
fstat(4, {st_mode=S_IFREG|0644, st_size=110, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b4f255f9000
read(4, "nameserver <ip>\nnameserver"..., 4096) = 110

Why so many retries? And are they retries? I had 5 dns-servers in resolv.conf atm, but that doesn't add up with 12 retries. Wouldn't it be a better idea to run these tests in parallel? So it doesn't affect all the other dns-lookups?

I've tweaked my setup with "-dns-timeout=3" and put "options timeout:1" in resolv.conf - which reduces the time to 45 seconds - but if we had 5000 servers in Xymon, we'd still be facing a major problem. Maybe its different in 4.3.3? Im running a weird 4.4.0-beta, and upgrading in the near future.

Regards,

Carl Melgaard
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xymon.com/pipermail/xymon/attachments/20110513/dd233089/attachment.html>


More information about the Xymon mailing list