[Xymon] XYMON Proxy Issue
Andy Smith
abs at shadymint.com
Sun May 11 22:03:11 CEST 2014
Andy Smith wrote:
> Hi,
>
> In February, Gautier reported this issue with xymonproxy on Solaris :-
>
> http://lists.xymon.com/pipermail/xymon/2014-February/039160.html
>
> I have come this week to update an installation of 4.2.3 on Solaris 9
> and have encountered the exact same issue as Gautier, but this time on
> the latest 4.3.17 code :-
>
> 2014-05-04 13:05:36 xymonproxy version 4.3.17 starting
> 2014-05-04 13:20:41 Listening on 0.0.0.0:1984 <http://0.0.0.0:1984>
> 2014-05-04 13:20:41 Sending to Xymon server(s) xx.xx.xx.xx:1984
> 2014-05-04 13:20:41 select() failed: Invalid argument
> 2014-05-04 13:20:41 select() failed: Invalid argument
> 2014-05-04 13:20:41 select() failed: Invalid argument
> 2014-05-04 13:20:41 select() failed: Invalid argument
> 2014-05-04 13:20:41 select() failed: Invalid argument
> 2014-05-04 13:20:41 select() failed: Invalid argument
> 2014-05-04 13:20:41 Too many select failures, aborting
> 2014-05-04 13:20:46 xymonproxy version 4.3.17 starting
>
> I do not suffer the connections in TIME_WAIT, just the constant
> restarting of the proxy every 15 minutes. Here is the truss as it gasps
> when falling over :-
>
> poll(0xFFBFF208, 1, 1000) = 0
> time() = 1399206937
> poll(0xFFBFF208, 1, 1000) = 0
> time() = 1399206938
> poll(0xFFBFF208, 1, 1000) = 0
> time() = 1399206939
> poll(0xFFBFF208, 1, 1000) = 0
> time() = 1399206940
> poll(0xFFBFF208, 1, 1000) = 0
> time() = 1399206941
> poll(0xFFBFF208, 1, 1000) = 0
> time() = 1399206942
> poll(0xFFBFF208, 1, 1000) = 1
> accept(3, 0x0003AC60, 0xFFBFF310, 1) = 4
> fcntl(4, F_SETFL, 0x00000080) = 0
> time() = 1399206942
> poll(0xFFBFF200, 2, 1000) = 1
> read(4, " s t a t u s + 4 5 c s".., 8185) = 140
> time() = 1399206942
> poll(0xFFBFF200, 2, 1000) = 1
> read(4, 0x00038CE2, 8045) = 0
> time() = 1399206942
> shutdown(4, 2, 1) = 0
> close(4) = 0
> poll(0xFFBFF208, 1, 1000) = 1
> accept(3, 0x0003ACD0, 0xFFBFF310, 1) = 4
> fcntl(4, F_SETFL, 0x00000080) = 0
> time() = 1399206942
> time() = 1399206942
> write(2, " 2 0 1 4 - 0 5 - 0 4 1".., 19) = 19
> write(2, " ", 1) = 1
> write(2, " s e l e c t ( ) f a i".., 34) = 34
> time() = 1399206942
> time() = 1399206942
> write(2, " 2 0 1 4 - 0 5 - 0 4 1".., 19) = 19
> write(2, " ", 1) = 1
> write(2, " s e l e c t ( ) f a i".., 34) = 34
> time() = 1399206942
> time() = 1399206942
> write(2, " 2 0 1 4 - 0 5 - 0 4 1".., 19) = 19
> write(2, " ", 1) = 1
> write(2, " s e l e c t ( ) f a i".., 34) = 34
> time() = 1399206942
> time() = 1399206942
> write(2, " 2 0 1 4 - 0 5 - 0 4 1".., 19) = 19
> write(2, " ", 1) = 1
> write(2, " s e l e c t ( ) f a i".., 34) = 34
> time() = 1399206942
> time() = 1399206942
> write(2, " 2 0 1 4 - 0 5 - 0 4 1".., 19) = 19
> write(2, " ", 1) = 1
> write(2, " s e l e c t ( ) f a i".., 34) = 34
> time() = 1399206942
> time() = 1399206942
> write(2, " 2 0 1 4 - 0 5 - 0 4 1".., 19) = 19
> write(2, " ", 1) = 1
> write(2, " s e l e c t ( ) f a i".., 34) = 34
> time() = 1399206942
> write(2, " 2 0 1 4 - 0 5 - 0 4 1".., 19) = 19
> write(2, " ", 1) = 1
> write(2, " T o o m a n y s e l".., 35) = 35
> _exit(1)
>
> So, question to Gautier, are you using Solaris 9 and have you managed to
> resolve this?
>
> Another question to the rest of the list, this is actually the only
> proxy I have on Solaris, all the otehrs are on Redhat, is anyone else
> using xymonproxy on Solaris and if so, what version? For the time
> being, I am running the old bbproxy until I get this fixed, the rest of
> 4.3.17 seems to be working OK.
Done a bit more digging around. Firstly, if I regress to r#7368
(4.3.13) then xymonproxy on Solaris is stable. This just hides the
problem of course and might be a factor in Gautier's performance issue.
If I modify the code for 4.3.17 to remove the exit after 5 select()
failures and add in some further debugging, I can observe that on
Solaris 9 at least :-
- every 900 seconds, select() fails
- select continues to fail for 2 seconds then succeeds and the proxy
continues as normal.
- during these 2 seconds, there are no further calls to poll(), but
somewhere in the region of 50,000 calls to time().
- the values for the selecttmo structure and maxfd are reasonable, so
the invalid argument must be one of the fdread or fdwrite structures.
Continuing to collect information but still not sure if I am looking at
a Sol9 issue or if this affects later Solaris versions.
--
Andy
More information about the Xymon
mailing list