[Xymon] xymongen hanging

Jeremy Laidman jeremy at laidman.org
Tue Oct 18 00:42:19 CEST 2022


Yep, the fact that the username is apache tells me that it wasn't initiated
by crontab or tasks.cfg, but instead by a user clicking on Reports >
Availability Report, and Reports > Snapshot Report, in the Xymon menu.

fd3 is a file with event history. The snapshot and availability reports
look through all of the history files to see any events that were present
at/during the report timeframe. So this is normal, unless it's stuck on the
same file for more than the briefest period. Did you run lsof on any other
processes to see what files were open on fd3? If it's the same file for all
of them, this might suggest a filesystem problem.

As these processes are owned by apache, it's worth taking a look at the
Apache logs around the time the processes were launched. You might be able
to get a more accurate start time from /proc/14749 than the output of "ps".

The missing dollar sign is peculiar. But I wonder if that's just what "ps"
does. Or bash. What does the output of "strings /proc/14749/cmdline" look
like?

The $XYMONGENSNAPOPTS comes from the script snapshot.sh. Mine definitely
has a dollar sign in there.

J

On Tue, 18 Oct 2022 at 08:45, David Logan <David.Logan at nt.gov.au> wrote:

> Thanks Jeremy,
>
>
>
> Yes I saw that but I’m somewhat confused. In the tasks.cfg xymongen is set
> to run every minute (I think this is the distribution copy) and it probably
> does as our graphs are up to date. On Sunday am it starts about 17
> processes to do snapshots and reports. The crontabs are empty and I cannot
> find where these are started from. The biggest problem is they take massive
> amounts of cpu while sitting at the fread of fd 3. I cannot work out what
> is holding it up. The whole thing should be over in a matter of an hour or
> so but it can take up to 72 hrs to process the whole show.
>
>
>
> I can also see what is possible an error in the process as I don’t think
> there is a $ in front of a variable and I’m wondering if this is the root
> cause.
>
>
>
> Thanks
>
> David
>
>
>
> *David Logan*
>
> *Senior Systems Administrator*
>
> *Data Centre Services*
>
> Department of *Corporate and Digital Development* *| *Northern Territory
> Government
> GPO Box 2391, Darwin, NT 0801,
> Australia
>
> *DCS Midrange Ticketing System*
>
> *p   ... <+61> 8 8999 6968 *
>
> *m …  <+61> 458 631 117            *New and Existing tickets:
> http://dcscentral.nt.gov.au/
>
> *e  ... **david.logan at nt.gov.au
> <david.logan at nt.gov.au>                                                *or
> dcs_service at nt.gov.au
>
> *w … www.nt.gov.au
> <http://www.nt.gov.au/>
>  **Escalations: (08) 8999 7654*
>
>
>
> *Our vision:* *improve government through services and solutions that
> exceed expectations*
>
> Our values: *Honest  **| **Professional*  *| Respectful  | **Accountable*
>   *| **Innovative *
>
> The information in this e-mail is intended solely for the addressee named.
> It may contain legally privileged or confidential information that is
> subject to copyright. If you are not the intended recipient you must not
> use, disclose copy or distribute this communication. If you have received
> this message in error, please delete the e-mail and notify the sender. No
> representation is made that this e-mail is free of viruses. Virus scanning
> is recommended and is the responsibility of the recipient.
>
> Please consider the environment before printing this email.
>
>
>
> *From:* Jeremy Laidman <jeremy at laidman.org>
> *Sent:* Monday, 17 October 2022 3:42 PM
> *To:* David Logan <David.Logan at nt.gov.au>
> *Cc:* xymon at xymon.com
> *Subject:* Re: [Xymon] xymongen hanging
>
>
>
> Hi David
>
>
>
> The "snapshot.cgi" runs from the web interface, and creates a snapshot
> report. The script snapshot.sh runs snapshot.cgi, and this in turn runs
> xymongen with "--snapshot=..." as an argument.
>
>
>
> Similarly, the "report.cgi" runs from the web interface, and creates an
> availability report, using "--reportops=..." as an argument.
>
>
>
> Also, take a look at the xymonreports.sh script. At the top (of my copy)
> of this script there are instructions on creating a crontab entry to run
> the script so as to generate daily, weekly and monthly reports. These would
> generate xymongen processes with "--reportopts=..." as an argument.
>
>
>
> See "man snapshot" and "man report" for more info.
>
>
>
> Cheers
>
> Jeremy
>
>
>
> On Mon, 17 Oct 2022 at 15:57, David Logan <David.Logan at nt.gov.au> wrote:
>
> Hi Folks,
>
>
>
> Just wondering if anybody has any experience with xymongen hanging. I have
> a large number of xymongen processes being kicked off sometime over the
> weekend, unfortunately they are owned by apache and have a PPID of 1 so I
> can’t tell how they were started. I’m presuming either xymoncmd but I can’t
> see anything in the crontab for xymon or in tasks.cfg that would kick off
> the snapshots and reporting processes.
>
>
>
> These then sit for a very long time (> 24hrs) while trying to read a data
> file from a specific server.
>
>
>
> apache   14749     1 44 Oct16 ?        10:28:39
> /xymon/server/server/bin/xymongen --snapshot=2222867979 XYMONGENSNAPOPTS
> /xymon/server/server/www/snap/14748-1665896723
>
> apache   14867     1 43 Oct16 ?        10:26:32
> /xymon/server/server/bin/xymongen --snapshot=2222867979 XYMONGENSNAPOPTS
> /xymon/server/server/www/snap/14866-1665896747
>
> apache   15107     1 43 Oct16 ?        10:26:05
> /xymon/server/server/bin/xymongen --snapshot=2222867979 XYMONGENSNAPOPTS
> /xymon/server/server/www/snap/15106-1665896768
>
> apache   15118     1 43 Oct16 ?        10:25:58
> /xymon/server/server/bin/xymongen --snapshot=2222867979 XYMONGENSNAPOPTS
> /xymon/server/server/www/snap/15117-1665896774
>
> apache   15125     1 43 Oct16 ?        10:25:12
> /xymon/server/server/bin/xymongen --snapshot=2222867979 XYMONGENSNAPOPTS
> /xymon/server/server/www/snap/15124-1665896783
>
> apache   15238     1 43 Oct16 ?        10:23:26
> /xymon/server/server/bin/xymongen --reportopts=2222871640:2222958039:1:
> /xymon/server/server/www/rep/15237-1665896797
>
> apache   15269     1 43 Oct16 ?        10:25:31
> /xymon/server/server/bin/xymongen --snapshot=2222867979 XYMONGENSNAPOPTS
> /xymon/server/server/www/snap/15268-1665896804
>
> apache   15349     1 43 Oct16 ?        10:22:20
> /xymon/server/server/bin/xymongen --snapshot=2222867979 XYMONGENSNAPOPTS
> /xymon/server/server/www/snap/15348-1665896807
>
> apache   15382     1 43 Oct16 ?        10:23:40
> /xymon/server/server/bin/xymongen --reportopts=2222871640:2222958039:1:
> /xymon/server/server/www/rep/15381-1665896828
>
> apache   15398     1 43 Oct16 ?        10:25:13
> /xymon/server/server/bin/xymongen --snapshot=2222867979 XYMONGENSNAPOPTS
> /xymon/server/server/www/snap/15397-1665896834
>
> apache   15400     1 43 Oct16 ?        10:22:59
> /xymon/server/server/bin/xymongen --reportopts=2222871640:2222958039:1:
> /xymon/server/server/www/rep/15399-1665896837
>
> apache   15757     1 43 Oct16 ?        10:24:48
> /xymon/server/server/bin/xymongen --reportopts=2222871640:2222958039:1:
> /xymon/server/server/www/rep/15756-1665896864
>
> apache   15842     1 43 Oct16 ?        10:22:32
> /xymon/server/server/bin/xymongen --reportopts=2222871640:2222958039:1:
> /xymon/server/server/www/rep/15841-1665896873
>
> apache   15964     1 43 Oct16 ?        10:24:21
> /xymon/server/server/bin/xymongen --reportopts=2222871640:2222958039:1:
> /xymon/server/server/www/rep/15963-1665896897
>
> apache   15996     1 43 Oct16 ?        10:22:25
> /xymon/server/server/bin/xymongen --reportopts=2222871640:2222958039:1:
> /xymon/server/server/www/rep/15995-1665896912
>
> apache   16133     1 43 Oct16 ?        10:22:07
> /xymon/server/server/bin/xymongen --reportopts=2222871640:2222958039:1:
> /xymon/server/server/www/rep/16132-1665896933
>
> apache   16149     1 43 Oct16 ?        10:23:37
> /xymon/server/server/bin/xymongen --reportopts=2222871640:2222958039:1:
> /xymon/server/server/www/rep/16148-1665896954
>
> apache   16215     1 43 Oct16 ?        10:23:45
> /xymon/server/server/bin/xymongen --reportopts=2222871640:2222958039:1:
> /xymon/server/server/www/rep/16214-1665896972
>
>
>
> An strace for the first pid is as follows (they are all the same) and
> looking at file descriptor 3
>
>
>
> [root at dcslmonitor 15238]# strace -f -p 14749
>
> Process 14749 attached
>
> read(3, "", 4096)                       = 0
>
> read(3, "", 4096)                       = 0
>
> read(3, "", 4096)                       = 0
>
> read(3, "", 4096)                       = 0
>
> read(3, "", 4096)                       = 0
>
> read(3, "", 4096)                       = 0
>
> read(3, "", 4096)                       = 0
>
> read(3, "", 4096)                       = 0
>
> read(3, "", 4096)                       = 0
>
> read(3, "", 4096)                       = 0
>
> read(3, "", 4096)                       = 0
>
> read(3, "", 4096)                       = 0
>
> read(3, "", 4096)                       = 0
>
> read(3, "", 4096)                       = 0
>
> read(3, "", 4096)                       = 0
>
> read(3, "", 4096)                       = 0
>
>
>
> fd3 is
>
>
>
> xymongen  14749               apache  cwd       DIR
> 253,0         6  134320195 /xymon/server/data/acks
>
> xymongen  14749               apache  rtd       DIR
> 8,2       269         64 /
>
> xymongen  14749               apache  txt       REG              253,0
> 1106256  135222190 /xymon/server/server/bin/xymongen
>
> xymongen  14749               apache  mem       REG                8,6
> 155784    4448319 /usr/lib64/libselinux.so.1
>
> xymongen  14749               apache  mem       REG                8,6
> 109976    4873245 /usr/lib64/libresolv-2.17.so
>
> xymongen  14749               apache  mem       REG                8,6
> 15688    4259351 /usr/lib64/libkeyutils.so.1.5
>
> xymongen  14749               apache  mem       REG                8,6
> 67104    4471490 /usr/lib64/libkrb5support.so.0.1
>
> xymongen  14749               apache  mem       REG                8,6
> 142144    4873243 /usr/lib64/libpthread-2.17.so
>
> xymongen  14749               apache  mem       REG                8,6
> 90632    4195838 /usr/lib64/libz.so.1.2.7
>
> xymongen  14749               apache  mem       REG                8,6
> 19248    4358022 /usr/lib64/libdl-2.17.so
>
> xymongen  14749               apache  mem       REG                8,6
> 210824    4471445 /usr/lib64/libk5crypto.so.3.1
>
> xymongen  14749               apache  mem       REG                8,6
> 15920    4939663 /usr/lib64/libcom_err.so.2.1
>
> xymongen  14749               apache  mem       REG                8,6
> 967840    4259800 /usr/lib64/libkrb5.so.3.3
>
> xymongen  14749               apache  mem       REG                8,6
> 320400    4256684 /usr/lib64/libgssapi_krb5.so.2.2
>
> xymongen  14749               apache  mem       REG                8,6
> 2156272    4262067 /usr/lib64/libc-2.17.so
>
> xymongen  14749               apache  mem       REG                8,6
> 402384    4259730 /usr/lib64/libpcre.so.1.2.0
>
> xymongen  14749               apache  mem       REG                8,6
> 2521008    4256674 /usr/lib64/libcrypto.so.1.0.2k
>
> xymongen  14749               apache  mem       REG                8,6
> 470360    4195836 /usr/lib64/libssl.so.1.0.2k
>
> xymongen  14749               apache  mem       REG                8,6
> 163312    4448246 /usr/lib64/ld-2.17.so
>
> xymongen  14749               apache    0r     FIFO
> 0,8       0t0  404824379 pipe
>
> xymongen  14749               apache    1w     FIFO
> 0,8       0t0  404824380 pipe
>
> xymongen  14749               apache    2w     FIFO
> 0,8       0t0  404824381 pipe
>
> xymongen  14749               apache    3r      REG
> 253,0       524   67195718 /xymon/server/data/hist/accessntg.sslcert
>
>
>
> Every process (in the process list above) shows they have the same file
> open as fd3, are they locking each other out or more to the point, should
> they be?
>
>
>
> Any ideas on where to look or what to do next?
>
>
>
> Thanks
>
>
>
> *David Logan*
>
> *Senior Systems Administrator*
>
> *Data Centre Services*
>
> Department of *Corporate and Digital Development* *| *Northern Territory
> Government
> GPO Box 2391, Darwin, NT 0801,
> Australia
>
> *DCS Midrange Ticketing System*
>
> *p   ... <+61> 8 8999 6968 *
>
> *m …  <+61> 458 631 117            *New and Existing tickets:
> http://dcscentral.nt.gov.au/
>
> *e  ... **david.logan at nt.gov.au
> <david.logan at nt.gov.au>                                                *or
> dcs_service at nt.gov.au
>
> *w … www.nt.gov.au
> <http://www.nt.gov.au/>
>  **Escalations: (08) 8999 7654*
>
>
>
> *Our vision:* *improve government through services and solutions that
> exceed expectations*
>
> Our values: *Honest  **| **Professional*  *| Respectful  | **Accountable*
>   *| **Innovative *
>
> The information in this e-mail is intended solely for the addressee named.
> It may contain legally privileged or confidential information that is
> subject to copyright. If you are not the intended recipient you must not
> use, disclose copy or distribute this communication. If you have received
> this message in error, please delete the e-mail and notify the sender. No
> representation is made that this e-mail is free of viruses. Virus scanning
> is recommended and is the responsibility of the recipient.
>
> Please consider the environment before printing this email.
>
>
>
> _______________________________________________
> Xymon mailing list
> Xymon at xymon.com
> http://lists.xymon.com/mailman/listinfo/xymon
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xymon.com/pipermail/xymon/attachments/20221018/306d16b5/attachment.htm>


More information about the Xymon mailing list