[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[hobbit] shared memory/semaphores after shutdown, stale alerts, other errors

To: hobbit (at) hswn.dk
Subject: [hobbit] shared memory/semaphores after shutdown, stale alerts, other errors
From: "Gore, David W (David)" <david.gore (at) verizonbusiness.com>
Date: Fri, 31 Aug 2007 13:26:30 +0000
Thread-index: Acfr0op6/OedzdKLTDOOpOrHOEYo2A==
Thread-topic: [hobbit] shared memory/semaphores after shutdown, stale alerts, other errors

Hobbit: 4.2.0 with allinone patch.
OS: Fedora Core 5, Dell Optiplex, dual core with 4G of memory

MAXMSG_STATUS=2048              # maximum size of a "status" message in
kB, default: 256
MAXMSG_CLIENT=4096              # maximum size of a "client" message in
kB, default: 512
MAXMSG_DATA=2048                # maximum size of a "data" message in
kB, default: 256

[hobbit (at) hobbit2 etc]$ ipcs -lm

After shutting down the hobbit server:

[hobbit (at) hobbit2 ~]$ ipcs

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x0901727f 5668872    hobbit    600        131072     0

------ Semaphore Arrays --------
key        semid      owner      perms      nsems
0x0901727f 7176201    hobbit    600        3

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages

[hobbit (at) hobbit2 ~]$ ipcrm -s 7176201
[hobbit (at) hobbit2 ~]$ ipcrm -m 5668872

The snapshot does not suffer from the left-over shared memory/segments
issue, but unfortunately with the description or comments being
displayed as part of the host name on bb2.html it is unusable.

Here is my second issue:

Page.log:

2007-08-31 01:29:22 Stale alert for host08:ts-delay dropped
2007-08-31 01:29:22 Stale alert for hostf:cpu dropped
2007-08-31 01:29:22 Stale alert for host2:ocs dropped
2007-08-31 01:29:22 Stale alert for host06:procs dropped
2007-08-31 01:29:22 Stale alert for host08:tl1am-mir dropped
2007-08-31 01:29:22 Stale alert for host12:procs dropped
2007-08-31 01:29:22 Stale alert for host15:tl1am-osi dropped
2007-08-31 01:29:22 Stale alert for host16:tl1am-osi dropped
2007-08-31 01:29:22 Stale alert for host6:procs dropped
2007-08-31 01:29:22 Stale alert for host2:memory dropped
2007-08-31 01:29:22 Stale alert for host3:memory dropped
2007-08-31 01:29:22 Stale alert for ns:sins-out dropped
2007-08-31 01:29:22 Stale alert for hostxx:procsLSE dropped
2007-08-31 01:24:10 hobbitd_alert: Got message 49083, expected 49082
2007-08-31 01:24:17 hobbitd_alert: Got message 49085, expected 49084
2007-08-31 01:25:05 hobbitd_alert: Got message 49097, expected 49090
2007-08-31 01:25:22 hobbitd_alert: Got message 49104, expected 49103
2007-08-31 01:25:42 hobbitd_alert: Got message 49118, expected 49117
2007-08-31 01:26:17 hobbitd_alert: Got message 49129, expected 49126
2007-08-31 01:26:17 hobbitd_alert: Got message 49132, expected 49130
2007-08-31 01:27:04 Dropping (more) garbled data
.
.
.
2007-08-31 09:43:18 hobbitd_alert: Got message 6950, expected 6946
Done
2007-08-31 09:43:40 hobbitd_alert: Got message 6965, expected 6955
stty: : Invalid argument
stty: : Invalid argument
2007-08-31 10:38:21 hobbitd_alert: Got message 7693, expected 7692
stty: : Invalid argument
stty: : Invalid argument
stty: : Invalid argument
stty: : Invalid argument
stty: : Invalid argument
stty: : Invalid argument
.
.
.
2007-08-31 12:34:49 Stale alert for host3:ocs dropped
2007-08-31 12:34:50 Stale alert for host4:ocs dropped
2007-08-31 12:34:50 Stale alert for host5:ocs dropped
2007-08-31 12:34:50 Stale alert for host09:procs dropped
2007-08-31 12:34:50 Stale alert for host10:procs dropped
2007-08-31 12:34:50 Stale alert for host14:tl1am-osi dropped
2007-08-31 12:34:50 Stale alert for host15:tl1am-mir dropped
2007-08-31 12:34:50 Stale alert for host103:se dropped
2007-08-31 12:34:50 Stale alert for host17a:procs dropped

I think some of this may be caused by a server side external script,
that ssh's to a remote host, runs a restart script, and emails to
management, developers and support that processes have been restarted.
Some of the ssh'ing is captured in the page.log file.  Regardless, what
is the deal with these stale alerts?  I restarted hobbit to dump them 11
hours ago, and had to restart hobbit again.  Unfortunately, these stale
alerts are causing unnecessary page-outs.

Any hints of debugging would be appreciated.  I suspect the restart
script, but it is needed to keep 19 hosts healthy until they are patched
next week, so I cannot stop using it.  Regardless that is not related to
the shared memory and semaphores because that was happening before the
restart script was installed.

I would even try to use our second hobbit server instance but it suffers
from 'Whoops ! bb failed to send message - timeout' every couple hours
with the same configuration, also resolved if I could use the snapshot.


------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 768000
max total shared memory (kbytes) = 8388608
min seg size (bytes) = 1


David Gore (v965-3670) 
Network Management Systems (NMS) 
IMPACT Transport Team Lead - SCSA, SCNA 
Page: 1-800-PAG-eMCI pin 1406090 
Vnet: 965-3676

Prev by Date: Re: [hobbit] Diagnosing client data (vmstat)
Next by Date: Disk Space & Alerting Groups
Previous by thread: Re: [hobbit] Diagnosing client data (vmstat)
Next by thread: Disk Space & Alerting Groups
Index(es):
- Date
- Thread