[Xymon] Always purple history after time shift on server - how to fix

Andrey Chervonets A.Chervonets at cominder.eu
Thu Mar 10 10:44:32 CET 2016


I would like to share some hints in resolving history reporting problem 
after big time shift on monitoring server - about 4 hours.
May be it will help anyone else.

It was some month ago, but I have found time to fix it only today.
What happened:
1. Time on monitoring host increased for 4 hours.
2. As result - all metrics reported Purple status (it is intended 
functionality, but would be nice XyMon detect big time shift and adopt 
reporting in some way)
3. It was problem at virtual host provider, I had reported the problem and 
time was fixed back to correct value
4. To fix current reporting I had cleaned some files under xymon/logs or 
acks (really I do not remember which ones right now) - this has reset last 
status duration information, but current values for all metrics become 
correct
5. Everythig become  OK, except that when I check history for metric ( 
...xymon-cgi/history.sh? ...)  for some metrics.
XyMon always reported Purple for last event (since that incident time).


It was just for some metrics (not all) and I had second monitoring server 
with the same information (not having time shift incident) and I was able 
to live with it some month.

Solution: 
Today I have fixed that reporting problem with the following steps, which 
should be executed for every host-metric pair having the problem

We should operate with 2 files:
1) host history file  like 
 hist/HOSTNAME 
# here we should find records with negative duration values like:
svcs 1435410898 1435426055 -15157 gr pu 1
who 1435410899 1435426055 -15156 gr pu 1
msgs 1435410899 1435426055 -15156 gr pu 1
netstat 1435410899 1435426055 -15156 gr pu 1
memory 1435411034 1435426055 -15021 ye pu 2
uptime 1435411140 1435426055 -14915 gr pu 1
procs 1435411145 1435426055 -14910 gr pu 1
disk 1435411150 1435426055 -14905 ye pu 2
cpu 1435411222 1435426055 -14833 gr pu 1

# and drop them

2) service history file like
 hist/HOSTNAME.svc
# again -  find records with negative duration values like:
Sat Jun 27 20:27:35 2015 purple 1435426055 -15157

# and  drop record(s)  - really should be just one 


Really to fix just one service reporting - it is enough to drop negative 
duration records from service history file only (tested).
But I do not see any reason to have such records in host history file, so 
I delete from that file too.

How to automate the process:
# find hist files for 
# step 1: 
find hist/ -print0 -name "*.*" | xargs -0 grep " -" | awk '{print $1" 
:"$4}' | grep ":-"

#output like:
...
hist/idc-oracle03.msc-sh.local:ssh :-14862
hist/idc-oracle03.msc-sh.local:dblock :-15012
hist/idc-oracle03.msc-sh.local:dbrec :-15012
hist/idc-oracle03.msc-sh.local:dbup :-15011
hist/idc-oracle03.msc-sh.local:dbext :-14989
...

# step 2:    find hist/ -print0 -name "*.*" | xargs -0 grep " -" | awk 
'{print $1" :"$8}' | grep ":-"
# output like:
..
hist/idc-oracle03,domain.com.dbrec:Sat :-15012
hist/gdc-oracle03,domain.com.dbup:Sat :-15136
hist/idc-oracle01,domain.com.disk:Sat :-14961
hist/gdc-oracle01,domain.com.dbaud:Thu :-26793
hist/gdc-oracle01,domain.com.dbaud:Sat :-14940
..

Then can automate the records removal too.


Best regards,

Andrey Chervonets
----------------------
SIA CoMinder
http://www.cominder.eu/


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xymon.com/pipermail/xymon/attachments/20160310/afa861a0/attachment.html>


More information about the Xymon mailing list