[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [hobbit] RRD crashing high availability hobbit



j.sansford (at) ntlworld.com wrote:
> Hi Buchan,
>
> We get a core dump, running a pstack gives the following info:
>
> core 'core' of 11142:   hobbitd_rrd --rrddir=/export/home/hobbit/data/rrd
>  fed28a17 _lwp_kill (1, 6) + 7
>  fecd1d63 raise    (6) + 1f
>  fecb1bad abort    (806fe88, fecd55f6, 8768eb0, 806a6ca, fed901c0, 0) + cd
>  08060291 xstrdup  (0, 806a6ca, 87d9d1c, 8081cc0, 84ed451, 0) + 31
>  0805bf7c do_netapp_extratest_rrd (84ec4ff, 806af10, 84ec8fa, 4a8b1bbf, 8081a00, 8081cc0) + 200
>  0805c1c9 do_netapp_extrastats_rrd (84ec4ff, 84ec509, 84ec511, 4a8b1bbf, 84ec4f4, 4a8b1bbf) + e1
>  0805e0ea update_rrd (84ec4ff, 84ec509, 84ec511, 4a8b1bbf, 84ec4f4, 0) + 7d6
>  08054044 main     (2, 804613c, 8046148) + 4dc
>  080539fc _start   (2, 8046484, 8046490, 0, 80464b6, 80464f6) + 80
>
>   
That looks like you are running extratest for a netapp which from what I
can see in hobbitd/do_rrd.c is what handles the xtstats column reported
by netapp.pl - just from a cursory glance at the code - I don't use it
myself. You really need to look at the C code to check it's doing the
right thing. You have 2 choices - quick fix is to disable just that test
in netapp.pl - other option is to work out what format it should be and
fix the test.

In 4.2.3 for example, the do_devmon.c RRD code doesn't actually
implement what is documented and I use a perl script with --extra-script
instead

Various RRD handlers are in hobbitd/rrd/do_*.c
Looking at the code for xstrdup in lib/memory.c as below you should
check your logs - it's probably getting called with a NULL pointer
(unlikely you're out of memory), but the logs should tell you.

char *xstrdup(const char *s)
{
        char *result;

        if (s == NULL) {
                errprintf("xstrdup: Cannot dup NULL string\n");
                abort();
        }

        result = strdup(s);
        if (result == NULL) {
                errprintf("xstrdup: Out of memory\n");
                abort();
        }

#ifdef MEMORY_DEBUG
        add_to_memlist(result, strlen(result)+1);
#endif

        return result;
}
> Note that as of 5.30pm today the logs for rrd-status.log is 127MB full of errors, which span over 607625 lines (this is just for today, we roll the logs each night). This seems abnormally large to me and I think eventually this is crashing the server. 
>
> Hope this helps. I will try and take a deeper look at the logs next time it happens...it seems to happen around once or twice a week.
>
> Cheers
> James.
>
> ---- Buchan Milne <bgmilne (at) staff.telkomsa.net> wrote: 
>   
>> On Thursday, 20 August 2009 11:06:30 j.sansford (at) ntlworld.com wrote:
>>     
>>> Hi again all,
>>>
>>> I need some help configuring/debugging why our hobbit servers are crashing
>>> (due to rrd, which I shall explain shortly) and how to get around this. We
>>> have 3 hobbit servers with proxies, however I will simplify this
>>> explanation with just 2 hobbits and no proxies (as we discovered the same
>>> thing happens).
>>>
>>> Detail of theoretical setup:
>>>
>>> 1) 2 datacentres. Each datacentre contains a single hobbit server instance.
>>> 2) Each client reports to their local datacentre hobbit server.
>>> 3) Each hobbit server is configured such that they know about the other
>>> hobbit (through BBDISPLAYS).
>>>
>>>
>>> The issue is that for what looks like most server side tests, such as
>>> vmstat etc, that we are getting feedback loops between the hobbit servers.
>>>
>>> For instance: A hobbit server in DC1 tests a client in DC1 using vmstat.
>>> The client reports back to hobbit in DC1 and hobbit then also reports this
>>> data to the hobbit in DC2. The hobbit in DC2 however is configured to
>>> report to DC1 and so bounces the message back (i think). Therefore the
>>> server tries to update the rrd twice within a second resulting in errors.
>>> Eventually this will crash the server.
>>>       
>> How did you determine that this is what is "crashing" the server?
>>
>>     
>>> An example of the rrd error
>>> messages:
>>>
>>> 2009-08-20 11:04:04 RRD error updating
>>> /export/home/hobbit/data/rrd/h3-avm-dbx/ifstat.mac.rrd from 10.6.60.1:
>>> illegal attempt to update using time 1250762644 when last update time is
>>> 1250762644 (minimum one second step)
>>> 2009-08-20 11:04:06 RRD error updating
>>> /export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1:
>>> illegal attempt to update using time 1250762646 when last update time is
>>> 1250762646 (minimum one second step) 2009-08-20 11:04:06 RRD error updating
>>> /export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1:
>>> illegal attempt to update using time 1250762646 when last update time is
>>> 1250762646 (minimum one second step) 2009-08-20 11:04:06 RRD error updating
>>> /export/home/hobbit/data/rrd/h2-emu13/ifstat.mac.rrd from 10.6.60.1:
>>> illegal attempt to update using time 1250762646 when last update time is
>>> 1250762646 (minimum one second step)
>>>       
>> I have a number of setups where messages like this are common, due to running 
>> network tests and SNMP polling at intervals smaller than 5 minutes (without 
>> adjusting all the RRD files to cater to this), and I have not seen hobbit 
>> "crash" due to this.
>>
>>     
These kinds of messages can also be due to duplicate keys being used in
RRD reporting. You need to look at how the RRD data is generated to get
to the bottom of these. Sometimes the duplicates are in one test,
sometimes multiple tests reporting the same thing or too frequently
(such as your possible loops). It is unlikely thsi will crash the
hobbitd_rrd though.

For example, I had this on MacOSX for ifstat. By default it uses
'netstat -ibn' which is producing multiple lines for the same interface.
I changed that in hobbitclient-darwin.sh to 'netstat -ibn | egrep -v
"^lo|^vmnet|<Link" - note that I had to filter out vmnet interfaces
since netstat -i limits to 5 chars for interface, and there are actually
vmnet1 and vmnet8 :( Luckily I don't really care about those.

bash-3.2# netstat -ibn | egrep -v "^lo|^vmnet|<Link"
Name  Mtu   Network       Address            Ipkts Ierrs     Ibytes   
Opkts Oerrs     Obytes  Coll
en0   1500  fe80::21f:f fe80:6::21f:f3ff:  7709215     - 2307390372
23616260     - 32787591390     -
en0   1500  10.1/16       10.1.75.6        7709215     - 2307390372
23616260     - 32787591390     -
en2   1500  fe80::201:2 fe80:9::201:23ff:        0     -         
0        0     -     781938     -
en2   1500  10.37.129/24  10.37.129.2            0     -         
0        0     -     781938     -
en3   1500  fe80::210:3 fe80:a::210:32ff:        0     -         
0        0     -     792748     -
en3   1500  10.211.55/24  10.211.55.2            0     -         
0        0     -     792748     -
bash-3.2# netstat -ibn         
Name  Mtu   Network       Address            Ipkts Ierrs     Ibytes   
Opkts Oerrs     Obytes  Coll
lo0   16384 <Link#1>                        196623     0   20477947  
196620     0   20477947     0
lo0   16384 fe80::1%lo0 fe80:1::1           196623     -   20477947  
196620     -   20477947     -
lo0   16384 127           127.0.0.1         196623     -   20477947  
196620     -   20477947     -
lo0   16384 ::1/128     ::1                 196623     -   20477947  
196620     -   20477947     -
gif0* 1280  <Link#2>                             0     0         
0        0     0          0     0
stf0* 1280  <Link#3>                             0     0         
0        0     0          0     0
en1   1500  <Link#4>    00:1f:5b:c3:ec:35        0     0         
0        0     0          0     0
fw0   4078  <Link#5>    00:1f:f3:ff:fe:71:5e:18        0     0         
0        0     0        346     0
en0   1500  <Link#6>    00:1f:f3:5c:32:e6  7709242     0 2307393391
23616262     0 32787591586     0
en0   1500  fe80::21f:f fe80:6::21f:f3ff:  7709242     - 2307393391
23616262     - 32787591586     -
en0   1500  10.1/16       10.1.75.6        7709242     - 2307393391
23616262     - 32787591586     -
vmnet 1500  <Link#7>    00:50:56:c0:00:08        0     0         
0        0     0          0     0
vmnet 1500  192.168.149   192.168.149.1          0     -         
0        0     -          0     -
vmnet 1500  <Link#8>    00:50:56:c0:00:01        0     0         
0        0     0          0     0
vmnet 1500  172.16.189/24 172.16.189.1           0     -         
0        0     -          0     -
en2   1500  <Link#9>    00:01:23:45:67:89        0     0         
0        0     0     781938     0
en2   1500  fe80::201:2 fe80:9::201:23ff:        0     -         
0        0     -     781938     -
en2   1500  10.37.129/24  10.37.129.2            0     -         
0        0     -     781938     -
en3   1500  <Link#10>   00:10:32:54:76:98        0     0         
0        0     0     792748     0
en3   1500  fe80::210:3 fe80:a::210:32ff:        0     -         
0        0     -     792748     -
en3   1500  10.211.55/24  10.211.55.2            0     -         
0        0     -     792748     -

>> What is the behaviour you see when it "crashes the server" ? Does hobbitd_rrd 
>> die and leave a status message? Or, does something else occur? Does the server 
>> reboot? Does the OS hang? How often does this occur?
>>
>>     
>>> My question is - how can we stop this happening?
>>>       
>> You would first need to tell us what is happening ...
>>
>>     
>>> Also, why is this
>>> happening? Is there a way we can disable rrd graphing on one server so just
>>> one hobbit server handles the graphing?
>>>
>>> I hope that makes sense. If you need further clarification please let me
>>> know.
>>>       
>> If hobbitd or hobbitd_rrd or some other process actually crashes, you should 
>> be able to get a core file, from which you can get a backtrace (e.g. with gdb), 
>> which would allow someone to see why it is crashing, and possibly fix it.
>>
>> Regards,
>> Buchan
>>     
>
>
> To unsubscribe from the hobbit list, send an e-mail to
> hobbit-unsubscribe (at) hswn.dk
>
>
>   


-- 
David Baldwin - IT Unit
Australian Sports Commission          www.ausport.gov.au
Tel 02 62147830 Fax 02 62141830       PO Box 176 Belconnen ACT 2616
david.baldwin (at) ausport.gov.au          Leverrier Street Bruce ACT 2617


-------------------------------------------------------------------------------------
Keep up to date with what's happening in Australian sport visit http://www.ausport.gov.au

This message is intended for the addressee named and may contain confidential and privileged information. If you are not the intended recipient please note that any form of distribution, copying or use of this communication or the information in it is strictly prohibited and may be unlawful. If you receive this message in error, please delete it and notify the sender.
-------------------------------------------------------------------------------------