[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [hobbit] DEVMON stops working every now and then



We have the same problem - I've even got devmon configured under SMF in Solaris however it doesn't pick up the fact its crashed as the process is still there.

A quick and dirty workaround we have is to send an alert on the "dm" monitor going purple - this allows the on-call engineer to be alerted to the fact we are no longer effectively monitoring the network devices and so to restart the process!

There must be a better way though...

---- Gregory Thomas <GThomas (at) fairdinkum.com> wrote: 
> I've got the same problem. Just had to restart after having it working for about 48 hours.
> 
> I have added devmon (0.3.1-beta1) to the mix only a few weeks ago and am running it on ubuntu (desktop 8.10) along with xymon 4.2.3 (running about 6 months). On a side note, the rrd graphing works quite well for connects, cpu, if_load, and memory.
> 
> to kill it I run "sudo killall devmon" and it goes from purple to green again without running anything else.
> 
> To get devmon running in the first place I've added the following to hobbitlaunch.cfg: (I'm not sure this is the "proper" way to handle and almost seems to too easy but it starts when I start xymon.)
> 
> hobbitlaunch.cfg
> ...
> [devmon]
>  CMD $BBHOME/ext/devmon/devmon
> 
> [devmonreload]
>  CMD $BBHOME/ext/devmon/devmon --readbbhosts
>  INTERVAL 5m
> ...
> I've seen others post that they have cron jobs daily or even more often to restart devmon but I wish that wasn't required.
> 
> Greg
> 
> ________________________________
> From: thorsten.erdmann (at) daimler.com [mailto:thorsten.erdmann (at) daimler.com]
> Sent: Wednesday, November 11, 2009 8:58 AM
> To: hobbit (at) hswn.dk
> Subject: [hobbit] DEVMON stops working every now and then
> 
> 
> Hello
> 
> some time ago I already talked about devmon stops working when a monitored device ist not responding. Now I saw it has nothing to do with non responsive devices.
> Devmon stops working at irregular intervals. I set Devmon to verbose and looked at the devmon log. I saw that there are simply no more messages when it stops working (see below). No error messages - nothing. None in the devmon log nor in the syslog.
> 
> If I do a "ps -ef" I see all devmon processes running:
> 
> [root (at) s068a300 devmon]# ps -ef |grep devmon
> hobbit   10211     1  0 Nov09 ?        00:10:07 devmon[master]
> hobbit   10214 10211  0 Nov09 ?        00:00:22 devmon
> hobbit   10215 10211  0 Nov09 ?        00:00:21 devmon
> hobbit   10217 10211  0 Nov09 ?        00:00:22 devmon
> hobbit   10218 10211  0 Nov09 ?        00:01:52 devmon
> hobbit   10219 10211  0 Nov09 ?        00:00:21 devmon
> hobbit   10220 10211  0 Nov09 ?        00:01:51 devmon
> hobbit   10221 10211  0 Nov09 ?        00:01:52 devmon
> hobbit   10222 10211  0 Nov09 ?        00:00:00 devmon
> hobbit   10223 10211  0 Nov09 ?        00:00:00 devmon
> root     20447  3611  0 14:47 pts/1    00:00:00 grep devmon
> 
> Any idea how I can find out why devmon stops working and what the processes do when they are stuck. If I send a SIGTERM to the devmon master process, it stops all other processe, so it looks it is responding to signals as it should.
> 
> BTW.: has anyone a devmon startup/shutdown script which works on SuSE EL.
> 
> Thorsten Erdmann
> 
> Attachement:
> Here are the last few lines of the devmon log
> 
> [09-11-10 (at) 10:52:21] Performing test logic
> [09-11-10 (at) 10:52:21] Done with test logic
> [09-11-10 (at) 10:52:21] Sending messages to display server
> [09-11-10 (at) 10:52:21] Done sending messages
> [09-11-10 (at) 10:52:21] Sleeping for 59 seconds.
> [09-11-10 (at) 10:53:20] Starting snmp queries
> [09-11-10 (at) 10:53:20] Getting device status from hobbit at localhost:1984
> [09-11-10 (at) 10:53:20] Querying u068usv020a1 for tests battery,powerin,power,diag,temperature,msgs
> [09-11-10 (at) 10:53:20] Querying u068usv020a2 for tests battery,powerin,power,diag,temperature,msgs
> [09-11-10 (at) 10:53:20] Querying u068usv020b1 for tests battery,powerin,power,diag,temperature,msgs
> [09-11-10 (at) 10:53:20] Querying u068usv020b2 for tests battery,powerin,power,diag,temperature,msgs
> [09-11-10 (at) 10:53:20] Querying u068usv110111 for tests power,temperature
> [09-11-10 (at) 10:53:20] Querying u068usvnw1111 for tests power,temperature
> [09-11-10 (at) 10:53:20] Querying u068usvnw1112 for tests power,temperature
> [09-11-10 (at) 10:53:20] Querying u068usvnw1211 for tests power,temperature
> [09-11-10 (at) 10:53:21] Performing test logic
> [09-11-10 (at) 10:53:21] Done with test logic
> [09-11-10 (at) 10:53:21] Sending messages to display server
> [09-11-10 (at) 10:53:21] Done sending messages
> [09-11-10 (at) 10:53:21] Sleeping for 59 seconds.
> [09-11-10 (at) 10:54:20] Starting snmp queries
> [09-11-10 (at) 10:54:20] Getting device status from hobbit at localhost:1984
> [09-11-10 (at) 10:54:20] Querying u068usv020a1 for tests battery,powerin,power,diag,temperature,msgs
> [09-11-10 (at) 10:54:21] Querying u068usv020a2 for tests battery,powerin,power,diag,temperature,msgs
> [09-11-10 (at) 10:54:21] Querying u068usv020b1 for tests battery,powerin,power,diag,temperature,msgs
> [09-11-10 (at) 10:54:21] Querying u068usv020b2 for tests battery,powerin,power,diag,temperature,msgs
> [09-11-10 (at) 10:54:21] Querying u068usv110111 for tests power,temperature
> [09-11-10 (at) 10:54:21] Querying u068usvnw1111 for tests power,temperature
> [09-11-10 (at) 10:54:21] Querying u068usvnw1112 for tests power,temperature
> [09-11-10 (at) 10:54:21] Querying u068usvnw1211 for tests power,temperature
> If you are not the intended addressee, please inform us immediately that you have received this e-mail in error, and delete it. We thank you for your cooperation.
>