[Xymon] XYMON Proxy Issue
Andy Smith
abs at shadymint.com
Tue May 20 08:28:46 CEST 2014
Andy Smith wrote:
> Andy Smith wrote:
>> Hi,
>>
>> In February, Gautier reported this issue with xymonproxy on Solaris :-
>>
>> http://lists.xymon.com/pipermail/xymon/2014-February/039160.html
>>
>> I have come this week to update an installation of 4.2.3 on Solaris 9
>> and have encountered the exact same issue as Gautier, but this time on
>> the latest 4.3.17 code :-
>>
>> 2014-05-04 13:05:36 xymonproxy version 4.3.17 starting
>> 2014-05-04 13:20:41 Listening on 0.0.0.0:1984 <http://0.0.0.0:1984>
>> 2014-05-04 13:20:41 Sending to Xymon server(s) xx.xx.xx.xx:1984
>> 2014-05-04 13:20:41 select() failed: Invalid argument
>> 2014-05-04 13:20:41 select() failed: Invalid argument
>> 2014-05-04 13:20:41 select() failed: Invalid argument
>> 2014-05-04 13:20:41 select() failed: Invalid argument
>> 2014-05-04 13:20:41 select() failed: Invalid argument
>> 2014-05-04 13:20:41 select() failed: Invalid argument
>> 2014-05-04 13:20:41 Too many select failures, aborting
>> 2014-05-04 13:20:46 xymonproxy version 4.3.17 starting
>>
>> I do not suffer the connections in TIME_WAIT, just the constant
>> restarting of the proxy every 15 minutes. Here is the truss as it
>> gasps when falling over :-
>>
>> poll(0xFFBFF208, 1, 1000) = 0
>> time() = 1399206937
>> poll(0xFFBFF208, 1, 1000) = 0
>> time() = 1399206938
>> poll(0xFFBFF208, 1, 1000) = 0
>> time() = 1399206939
>> poll(0xFFBFF208, 1, 1000) = 0
>> time() = 1399206940
>> poll(0xFFBFF208, 1, 1000) = 0
>> time() = 1399206941
>> poll(0xFFBFF208, 1, 1000) = 0
>> time() = 1399206942
>> poll(0xFFBFF208, 1, 1000) = 1
>> accept(3, 0x0003AC60, 0xFFBFF310, 1) = 4
>> fcntl(4, F_SETFL, 0x00000080) = 0
>> time() = 1399206942
>> poll(0xFFBFF200, 2, 1000) = 1
>> read(4, " s t a t u s + 4 5 c s".., 8185) = 140
>> time() = 1399206942
>> poll(0xFFBFF200, 2, 1000) = 1
>> read(4, 0x00038CE2, 8045) = 0
>> time() = 1399206942
>> shutdown(4, 2, 1) = 0
>> close(4) = 0
>> poll(0xFFBFF208, 1, 1000) = 1
>> accept(3, 0x0003ACD0, 0xFFBFF310, 1) = 4
>> fcntl(4, F_SETFL, 0x00000080) = 0
>> time() = 1399206942
>> time() = 1399206942
>> write(2, " 2 0 1 4 - 0 5 - 0 4 1".., 19) = 19
>> write(2, " ", 1) = 1
>> write(2, " s e l e c t ( ) f a i".., 34) = 34
>> time() = 1399206942
>> time() = 1399206942
>> write(2, " 2 0 1 4 - 0 5 - 0 4 1".., 19) = 19
>> write(2, " ", 1) = 1
>> write(2, " s e l e c t ( ) f a i".., 34) = 34
>> time() = 1399206942
>> time() = 1399206942
>> write(2, " 2 0 1 4 - 0 5 - 0 4 1".., 19) = 19
>> write(2, " ", 1) = 1
>> write(2, " s e l e c t ( ) f a i".., 34) = 34
>> time() = 1399206942
>> time() = 1399206942
>> write(2, " 2 0 1 4 - 0 5 - 0 4 1".., 19) = 19
>> write(2, " ", 1) = 1
>> write(2, " s e l e c t ( ) f a i".., 34) = 34
>> time() = 1399206942
>> time() = 1399206942
>> write(2, " 2 0 1 4 - 0 5 - 0 4 1".., 19) = 19
>> write(2, " ", 1) = 1
>> write(2, " s e l e c t ( ) f a i".., 34) = 34
>> time() = 1399206942
>> time() = 1399206942
>> write(2, " 2 0 1 4 - 0 5 - 0 4 1".., 19) = 19
>> write(2, " ", 1) = 1
>> write(2, " s e l e c t ( ) f a i".., 34) = 34
>> time() = 1399206942
>> write(2, " 2 0 1 4 - 0 5 - 0 4 1".., 19) = 19
>> write(2, " ", 1) = 1
>> write(2, " T o o m a n y s e l".., 35) = 35
>> _exit(1)
>>
>> So, question to Gautier, are you using Solaris 9 and have you managed
>> to resolve this?
>>
>> Another question to the rest of the list, this is actually the only
>> proxy I have on Solaris, all the otehrs are on Redhat, is anyone else
>> using xymonproxy on Solaris and if so, what version? For the time
>> being, I am running the old bbproxy until I get this fixed, the rest
>> of 4.3.17 seems to be working OK.
>
> Done a bit more digging around. Firstly, if I regress to r#7368
> (4.3.13) then xymonproxy on Solaris is stable. This just hides the
> problem of course and might be a factor in Gautier's performance issue.
>
> If I modify the code for 4.3.17 to remove the exit after 5 select()
> failures and add in some further debugging, I can observe that on
> Solaris 9 at least :-
>
> - every 900 seconds, select() fails
> - select continues to fail for 2 seconds then succeeds and the proxy
> continues as normal.
> - during these 2 seconds, there are no further calls to poll(), but
> somewhere in the region of 50,000 calls to time().
> - the values for the selecttmo structure and maxfd are reasonable, so
> the invalid argument must be one of the fdread or fdwrite structures.
>
> Continuing to collect information but still not sure if I am looking at
> a Sol9 issue or if this affects later Solaris versions.
This issue affected Solaris 10 as well, the attached patch resolves all
my xymonproxy stability problems on Solaris platforms, I believe the
patch is relevant to other platforms also, just that the select() on
other platforms is more tolerant.
--
Andy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: xymon-4.3.17-patch.tar.gz
Type: application/x-gzip
Size: 332 bytes
Desc: not available
URL: <http://lists.xymon.com/pipermail/xymon/attachments/20140520/2ccf9cea/attachment.bin>
More information about the Xymon
mailing list