[Xymon] logfetch issue - sending old data causing false alerts

Max Xu Max.Xu at servicenow.com
Tue Jun 27 02:49:02 CEST 2017


Hi Jeremy,

The log file has only 39 lines added on the entire day of 6/19 that fired the alert with the strange behavior. On 6/20, log file has also 39 entries added only, yet another similar alert fired. A note, probably does not matter, on 6/14, there were almost 4m entries generated in 10h.

So, it does not seem the file size growing is linked to it. I am guessing it is particular state of the server causing the file saving error. We have big number of servers and this only happens on this particular one.

From: Jeremy Laidman <jlaidman at rebel-it.com.au<mailto:jlaidman at rebel-it.com.au>>
Date: Wednesday, June 21, 2017 at 6:21 PM
To: Max Xu <Max.Xu at servicenow.com<mailto:Max.Xu at servicenow.com>>
Cc: "xymon at xymon.com<mailto:xymon at xymon.com>" <xymon at xymon.com<mailto:xymon at xymon.com>>
Subject: Re: [Xymon] logfetch issue - sending old data causing false alerts

Max

It's certainly weird. This seems like some kind of integer overflow. If the log update rate high, I can imagine at some point, Xymon's calculations might cause an undetected overflow/underflow. Can you give an indication of the approximate increase in file size per 5 minutes, ideally around the time of an incidence of this fault?

You might be able to reproduce the problem by creating your own copy of the logfile and the logfetch.<hostname>.cfg and logfetch.<hostname>status files. If you can manipulate the logfile copy to have the size reported in your logfetch logs just prior to a fault (and confirm that it ends with the log lines reported in the Xymon alert), and then add the extra records to be the size at the time of the fault, you can run logfetch yourself to operate on the copies of the logfile. Once you can reproduce it in this way, we can work on troubleshooting the code.

It might be interesting to use the TRACK keyword in analysis.cfg to monitor the size of the file, and see if there's some correlation with the occurrence of the fault you're seeing.

Cheers
Jeremy


On 22 June 2017 at 10:04, Max Xu <Max.Xu at servicenow.com<mailto:Max.Xu at servicenow.com>> wrote:
Thanks Jeremy.
The log file is created and updated by oracle Db. It was rotated by at 6/14 (note this ill alert was generated at 6/19).The file shrunk is not possible. I also checked (best effort) that no one manually touch the file.

From: Jeremy Laidman <jlaidman at rebel-it.com.au<mailto:jlaidman at rebel-it.com.au>>
Date: Wednesday, June 21, 2017 at 2:05 PM
To: Max Xu <Max.Xu at servicenow.com<mailto:Max.Xu at servicenow.com>>
Cc: "xymon at xymon.com<mailto:xymon at xymon.com>" <xymon at xymon.com<mailto:xymon at xymon.com>>
Subject: Re: [Xymon] logfetch issue - sending old data causing false alerts

How is the log file being created and written to? When and how is it rotated? The logfetch output indicates that that file shrunk in size twice; is that possible?

J

On 22 Jun. 2017 05:20, "Max Xu" <Max.Xu at servicenow.com<mailto:Max.Xu at servicenow.com>> wrote:
Hi,

Logfetch is sending old data causing false alerts.

The log file looks somewhat like this:
Error    2017-06-14 11:36:58.613343 39915 2184308576 Compare server: ……
Error    2017-06-14 11:36:58.613481 39913 1581872992 Command server: ……
……  (note. The above repeat about 780K times)
Info     2017-06-14 13:07:41.113163 1193 1036199776 Compare server exited normally, pid = 45494  [sp_desvr]
…..
Error    2017-06-15 02:42:22.820068 1761 2399766368 Command server:…..
……


At 6/19 and 6/20, msgs alert generated with all the old data of 6/14 and 6/15 etc. below is sniper of alert on 6/19
Mon Jun 19 17:48:57 PDT 2017 - Log files NOT ok

[red] Critical entries in /u01/shareplex/var/log/event_log<https://monitor01.lhr9.service-now.com/xymon-cgi/svcstatus.sh?CLIENT=ora164106.sjc4.service-now.com&SECTION=msgs:/u01/shareplex/var/log/event_log>[red] Error    2017-06-14 12:07:24.545252 9795 1581102944 Command server: ReconcileLog: failed to construct object-cache: Illegal state: Item 372354 already in the object id registry  (connecting from ora164106.sjc4.service-now.com<http://ora164106.sjc4.service-now.com>) [module osp]
[red] Error    2017-06-14 12:07:24.545499 9795 1581102944 Command server: ReconcileLog: failed to construct object-cache: Illegal state: Item 372356 already in the object id registry  (connecting from ora164106.sjc4.service-now.com<http://ora164106.sjc4.service-now.com>) [module osp]



Meantime, see xymonclient.log:
2017-06-19 17:49:01.428381 logfetch: File /u01/shareplex/var/log/event_log shrank from >=173538314 to 48414720 bytes in size. Probably rotated; clearing position state
2017-06-19 17:49:01.428462 logfetch: /u01/shareplex/var/log/event_log delta 48414720 bytes exceeds max buffer size 10485760; skipping some data
2017-06-19 17:51:05.086815 logfetch: /u01/shareplex/var/log/event_log delta 173538314 bytes exceeds max buffer size 10485760; skipping some data
2017-06-19 17:53:09.134469<tel:13%2044%2069> logfetch: /u01/shareplex/var/log/event_log delta 173538314 bytes exceeds max buffer size 10485760; skipping some data
2017-06-19 17:55:12.647682 logfetch: /u01/shareplex/var/log/event_log delta 173538314 bytes exceeds max buffer size 10485760; skipping some data
2017-06-19 17:57:16.163913 logfetch: /u01/shareplex/var/log/event_log delta 173538314 bytes exceeds max buffer size 10485760; skipping some data
2017-06-19 17:59:19.662801 logfetch: /u01/shareplex/var/log/event_log delta 173538314 bytes exceeds max buffer size 10485760; skipping some data
2017-06-19 18:01:23.180499 logfetch: /u01/shareplex/var/log/event_log delta 173538453 bytes exceeds max buffer size 10485760; skipping some data
2017-06-19 18:03:26.777636 logfetch: /u01/shareplex/var/log/event_log delta 125123733 bytes exceeds max buffer size 10485760; skipping some data
2017-06-20 06:42:01.519481 logfetch: File /u01/shareplex/var/log/event_log shrank from >=173541482 to 74420224 bytes in size. Probably rotated; clearing position state
2017-06-20 06:42:01.519557 logfetch: /u01/shareplex/var/log/event_log delta 74420224 bytes exceeds max buffer size 10485760; skipping some data
2017-06-20 06:44:05.173606 logfetch: /u01/shareplex/var/log/event_log delta 173541633 bytes exceeds max buffer size 10485760; skipping some data
2017-06-20 06:46:08.670466 logfetch: /u01/shareplex/var/log/event_log delta 173541633 bytes exceeds max buffer size 10485760; skipping some data
2017-06-20 06:48:12.188216 logfetch: /u01/shareplex/var/log/event_log delta 173541633 bytes exceeds max buffer size 10485760; skipping some data
2017-06-20 06:50:15.683455 logfetch: /u01/shareplex/var/log/event_log delta 173541633 bytes exceeds max buffer size 10485760; skipping some data
2017-06-20 06:52:19.250727 logfetch: /u01/shareplex/var/log/event_log delta 173541633 bytes exceeds max buffer size 10485760; skipping some data
2017-06-20 06:54:22.752463 logfetch: /u01/shareplex/var/log/event_log delta 173541633 bytes exceeds max buffer size 10485760; skipping some data
2017-06-20 06:56:23.426678 logfetch: /u01/shareplex/var/log/event_log delta 99121409 bytes exceeds max buffer size 10485760; skipping some data

Noted.

  1.  The above 2m interval is my setup of xymon client.
  2.  It seems the logfetch status file is not successfully saved and source code shows no error check (so no direct evidence).
  3.  The behavior only last under 20 min. The server itself did not have disk and cpu alerts and no one report any issues related to disk and io.

I was told that this behavior is not new although rarely happen. Is there any solution or work round?

My running version is:
Xymon version 4.3.25-1.el6.terabithia


Thanks,
-max


_______________________________________________
Xymon mailing list
Xymon at xymon.com<mailto:Xymon at xymon.com>
http://lists.xymon.com/mailman/listinfo/xymon<https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.xymon.com_mailman_listinfo_xymon&d=DwMFaQ&c=Zok6nrOF6Fe0JtVEqKh3FEeUbToa1PtNBZf6G01cvEQ&r=_rRsvDpSvkhydGEmSgSLQ5J0DhQTISZJFuX6D4Oq8-A&m=jlInc0qH4D6d4VJhKVlklq7Q5yzJEljN-Xn1abaymNs&s=2sE4XpgdTb8MJw0cvVJQANiKNarwuYgReKpzXR9lL4k&e=>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xymon.com/pipermail/xymon/attachments/20170627/0cc6a1ba/attachment.html>


More information about the Xymon mailing list