[Xymon] Xymon "port" check intermittent failures for ssh TCP port 22 state=LISTEN

Jeremy Laidman jlaidman at rebel-it.com.au
Wed Jul 12 08:51:42 CEST 2017


Just giving a follow-up for those interested or affected by this.

I believe I'm closer to understanding this problem. I've setup two "while"
loops on the server, one that runs "netstat -nl | grep :22" every second,
and the other that runs "ss -ln|grep :22" every second. In the former case,
I get output most of the time, but I get no output about 3-6 times every
couple of hours. In the latter case, I always get the expected output. This
suggests to me that netstat is not doing the right thing, possibly due to a
race condition that is exacerbated under load.

Ultimately, it's not a Xymon problem at all, it would seem. A Xymon fix
might be to modify xymonclient-linux.sh to use "ss" instead of "netstat",
but he output formats are different, and it would require the parser to be
re-written or enhanced. Instead, I should get netstat fixed.


On 8 July 2017 at 08:32, Jeremy Laidman <jlaidman at rebel-it.com.au> wrote:

> Yes, I do the network test also. This means I could just disable 22 in the
> port test, and rely on the network test. It's an adequate work-around in
> this case. Thanks.
>
> I'd still like to know why it's a problem.
>
> J
>
>
> On 8 Jul. 2017 04:08, "Mike Burger" <mburger at bubbanfriends.org> wrote:
>
> On 2017-07-07 2:51 am, Jeremy Laidman wrote:
>
> Not much chance, really. This was my first guess at the cause. The [ports]
> section appears complete (doesn't have its own limit as far as I know), the
> [clock] section is present at the end, and the UTC: datestamp line is
> present as the last line. Hence no artefacts I would expect to see when
> truncation takes place.
>
> Also, the client messages are less than 300kB, whereas the default limit
> is 512kB and I've bumped that up to 2MB.
>
>
> On 7 July 2017 at 15:05, Ryan Novosielski <novosirj at rutgers.edu> wrote:
>
>> Any chance this is truncation happening? That test can have a lot of
>> output.
>>
>> --
>> ____
>> || \\UTGERS,       |---------------------------*O
>> *---------------------------
>> ||_// the State     |         Ryan Novosielski - novosirj at rutgers.edu
>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS
>> Campus
>> ||  \\    of NJ     | Office of Advanced Research Computing - MSB C630,
>> Newark
>>     `'
>>
>> On Jul 7, 2017, at 00:47, Jeremy Laidman <jlaidman at rebel-it.com.au>
>> wrote:
>>
>> Hi
>>
>> I'm getting what appear to be false-positives for the port test that is
>> monitoring the LISTEN socket for port 22, as opened by the sshd daemon. A
>> few times a month, Xymon will show that the server is not listening on port
>> 22, and 5 minutes later, the listening port is back again. The sshd process
>> has never crashed or been reconfigured (eg with SIGHUP), and no other
>> listening ports are showing the same behaviour.  The client messages for
>> the server during these events are complete and uncorrupted.
>>
>> The simplest fix is to use delayred to suppress alerts for 5 minutes.
>> However, I would like to work out what's causing this behaviour. I don't
>> believe this a problem with Xymon at all, and instead the netstat output in
>> the client message is exactly what the OS provided the Xymon client. My
>> guess is that it's due to a the way sshd works - perhaps it periodically
>> rebinds to the socket - but nothing in the sshd logs seems to correlate
>> with these events. If anyone can suggest what might be causing this, or how
>> to investigate further, I'd be grateful.
>>
>> This problem happens for about a quarter of the servers in a pool, and no
>> others. All servers are identical in OS, software and general
>> configuration, but the servers affected by this tend to be the ones taking
>> the most traffic and under the most load (although there's plenty of spare
>> CPU cycles even on the most heavily-used server). I have two Xymon servers,
>> each monitoring independently of the other, and this problem is reported by
>> both Xymon servers, although at completely different dates and times.
>>
>> Cheers
>> Jeremy
>>
>>
>>
> Have you considered adding the SSH network test, in conjunction?
>
> --
> Mike Burger
> http://www.bubbanfriends.org
>
> "It's always suicide-mission this, save-the-planet that. No one ever just
> stops by to say 'hi' anymore." --Colonel Jack O'Neill, SG1
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xymon.com/pipermail/xymon/attachments/20170712/d77b49aa/attachment.html>


More information about the Xymon mailing list