[Xymon] logfetch issue - sending old data causing false alerts

Jeremy Laidman jlaidman at rebel-it.com.au
Thu Jun 22 03:21:38 CEST 2017


Max

It's certainly weird. This seems like some kind of integer overflow. If the
log update rate high, I can imagine at some point, Xymon's calculations
might cause an undetected overflow/underflow. Can you give an indication of
the approximate increase in file size per 5 minutes, ideally around the
time of an incidence of this fault?

You might be able to reproduce the problem by creating your own copy of the
logfile and the logfetch.<hostname>.cfg and logfetch.<hostname>status
files. If you can manipulate the logfile copy to have the size reported in
your logfetch logs just prior to a fault (and confirm that it ends with the
log lines reported in the Xymon alert), and then add the extra records to
be the size at the time of the fault, you can run logfetch yourself to
operate on the copies of the logfile. Once you can reproduce it in this
way, we can work on troubleshooting the code.

It might be interesting to use the TRACK keyword in analysis.cfg to monitor
the size of the file, and see if there's some correlation with the
occurrence of the fault you're seeing.

Cheers
Jeremy


On 22 June 2017 at 10:04, Max Xu <Max.Xu at servicenow.com> wrote:

> Thanks Jeremy.
> The log file is created and updated by oracle Db. It was rotated by at
> 6/14 (note this ill alert was generated at 6/19).The file shrunk is not
> possible. I also checked (best effort) that no one manually touch the file.
>
> From: Jeremy Laidman <jlaidman at rebel-it.com.au>
> Date: Wednesday, June 21, 2017 at 2:05 PM
> To: Max Xu <Max.Xu at servicenow.com>
> Cc: "xymon at xymon.com" <xymon at xymon.com>
> Subject: Re: [Xymon] logfetch issue - sending old data causing false
> alerts
>
> How is the log file being created and written to? When and how is it
> rotated? The logfetch output indicates that that file shrunk in size twice;
> is that possible?
>
> J
>
> On 22 Jun. 2017 05:20, "Max Xu" <Max.Xu at servicenow.com> wrote:
>
>> Hi,
>>
>> Logfetch is sending old data causing false alerts.
>>
>> The log file looks somewhat like this:
>> Error    2017-06-14 11:36:58.613343 39915 2184308576 Compare server: ……
>> Error    2017-06-14 11:36:58.613481 39913 1581872992 Command server: ……
>> ……  (note. The above repeat about 780K times)
>> Info     2017-06-14 13:07:41.113163 1193 1036199776 Compare server exited
>> normally, pid = 45494  [sp_desvr]
>> …..
>> Error    2017-06-15 02:42:22.820068 1761 2399766368 Command server:…..
>> ……
>>
>>
>> At 6/19 and 6/20, msgs alert generated with all the old data of 6/14 and
>> 6/15 etc. below is sniper of alert on 6/19
>> Mon Jun 19 17:48:57 PDT 2017 - Log files NOT ok
>>
>> [image: red] Critical entries in /u01/shareplex/var/log/event_log <https://monitor01.lhr9.service-now.com/xymon-cgi/svcstatus.sh?CLIENT=ora164106.sjc4.service-now.com&SECTION=msgs:/u01/shareplex/var/log/event_log>[image: red] Error    2017-06-14 12:07:24.545252 9795 1581102944 Command server: ReconcileLog: failed to construct object-cache: Illegal state: Item 372354 already in the object id registry  (connecting from ora164106.sjc4.service-now.com) [module osp][image: red] Error    2017-06-14 12:07:24.545499 9795 1581102944 Command server: ReconcileLog: failed to construct object-cache: Illegal state: Item 372356 already in the object id registry  (connecting from ora164106.sjc4.service-now.com) [module osp]
>>
>>
>>
>> Meantime, see xymonclient.log:
>> 2017-06-19 17:49:01.428381 logfetch: File /u01/shareplex/var/log/event_log
>> shrank from >=173538314 to 48414720 bytes in size. Probably rotated;
>> clearing position state
>> 2017-06-19 17:49:01.428462 logfetch: /u01/shareplex/var/log/event_log
>> delta 48414720 bytes exceeds max buffer size 10485760; skipping some data
>> 2017-06-19 17:51:05.086815 logfetch: /u01/shareplex/var/log/event_log
>> delta 173538314 bytes exceeds max buffer size 10485760; skipping some data
>> 2017-06-19 17:53:09.134469 <13%2044%2069> logfetch:
>> /u01/shareplex/var/log/event_log delta 173538314 bytes exceeds max
>> buffer size 10485760; skipping some data
>> 2017-06-19 17:55:12.647682 logfetch: /u01/shareplex/var/log/event_log
>> delta 173538314 bytes exceeds max buffer size 10485760; skipping some data
>> 2017-06-19 17:57:16.163913 logfetch: /u01/shareplex/var/log/event_log
>> delta 173538314 bytes exceeds max buffer size 10485760; skipping some data
>> 2017-06-19 17:59:19.662801 logfetch: /u01/shareplex/var/log/event_log
>> delta 173538314 bytes exceeds max buffer size 10485760; skipping some data
>> 2017-06-19 18:01:23.180499 logfetch: /u01/shareplex/var/log/event_log
>> delta 173538453 bytes exceeds max buffer size 10485760; skipping some data
>> 2017-06-19 18:03:26.777636 logfetch: /u01/shareplex/var/log/event_log
>> delta 125123733 bytes exceeds max buffer size 10485760; skipping some data
>> 2017-06-20 06:42:01.519481 logfetch: File /u01/shareplex/var/log/event_log
>> shrank from >=173541482 to 74420224 bytes in size. Probably rotated;
>> clearing position state
>> 2017-06-20 06:42:01.519557 logfetch: /u01/shareplex/var/log/event_log
>> delta 74420224 bytes exceeds max buffer size 10485760; skipping some data
>> 2017-06-20 06:44:05.173606 logfetch: /u01/shareplex/var/log/event_log
>> delta 173541633 bytes exceeds max buffer size 10485760; skipping some data
>> 2017-06-20 06:46:08.670466 logfetch: /u01/shareplex/var/log/event_log
>> delta 173541633 bytes exceeds max buffer size 10485760; skipping some data
>> 2017-06-20 06:48:12.188216 logfetch: /u01/shareplex/var/log/event_log
>> delta 173541633 bytes exceeds max buffer size 10485760; skipping some data
>> 2017-06-20 06:50:15.683455 logfetch: /u01/shareplex/var/log/event_log
>> delta 173541633 bytes exceeds max buffer size 10485760; skipping some data
>> 2017-06-20 06:52:19.250727 logfetch: /u01/shareplex/var/log/event_log
>> delta 173541633 bytes exceeds max buffer size 10485760; skipping some data
>> 2017-06-20 06:54:22.752463 logfetch: /u01/shareplex/var/log/event_log
>> delta 173541633 bytes exceeds max buffer size 10485760; skipping some data
>> 2017-06-20 06:56:23.426678 logfetch: /u01/shareplex/var/log/event_log
>> delta 99121409 bytes exceeds max buffer size 10485760; skipping some data
>>
>> Noted.
>>
>>    1. The above 2m interval is my setup of xymon client.
>>    2. It seems the logfetch status file is not successfully saved and
>>    source code shows no error check (so no direct evidence).
>>    3. The behavior only last under 20 min. The server itself did not
>>    have disk and cpu alerts and no one report any issues related to disk and
>>    io.
>>
>>
>> I was told that this behavior is not new although rarely happen. Is there
>> any solution or work round?
>>
>> My running version is:
>> Xymon version 4.3.25-1.el6.terabithia
>>
>>
>> Thanks,
>> -max
>>
>>
>> _______________________________________________
>> Xymon mailing list
>> Xymon at xymon.com
>> http://lists.xymon.com/mailman/listinfo/xymon
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.xymon.com_mailman_listinfo_xymon&d=DwMFaQ&c=Zok6nrOF6Fe0JtVEqKh3FEeUbToa1PtNBZf6G01cvEQ&r=_rRsvDpSvkhydGEmSgSLQ5J0DhQTISZJFuX6D4Oq8-A&m=jlInc0qH4D6d4VJhKVlklq7Q5yzJEljN-Xn1abaymNs&s=2sE4XpgdTb8MJw0cvVJQANiKNarwuYgReKpzXR9lL4k&e=>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xymon.com/pipermail/xymon/attachments/20170622/7d840a74/attachment.html>


More information about the Xymon mailing list