Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[tlug] Bogus load data



Today I came across a host that was reporting bogus load data:

: jglover@example.com; top
00:39:48  up 24 days,  5:16,  8 users,  load average: 71.25, 71.17, 71.03
720 processes: 719 sleeping, 1 running, 0 zombie, 0 stopped
CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
          total    2.8%    0.8%    1.1%   0.0%     0.2%    0.1%   94.5%
          cpu00    5.1%    0.9%    1.5%   0.0%     0.5%    0.1%   91.4%
          cpu01    0.5%    0.7%    0.7%   0.0%     0.0%    0.1%   97.6%
Mem:  8209748k av, 8085384k used,  124364k free,       0k shrd,  195128k buff
                  3144408k actv, 3564156k in_d,  129840k in_c
Swap: 2040212k av,   50420k used, 1989792k free                 5901868k cached

 PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
8171 xxxxxx    15   0 47120  46M 12972 S     1.4  0.5   0:13   0 XxXxXxXxXxX
16297 xxxxxx    25  10 30672  29M  3284 S N   0.6  0.3   0:36   0 perl
14272 xxxxxx     0 -20 42288  40M  1004 S <   0.4  0.5 149:51   0 rvd
[...]

The root cause turned out to be:

: jglover@example.com; tail /var/log/kern.log
Sep 32 25:20:02 xxxxxxxxxxx kernel: nfs: server yyyyyyyyyyyyyyy not
responding, still trying
Sep 32 25:20:02 xxxxxxxxxxx kernel: nfs: server yyyyyyyyyyyyyyy not
responding, still trying

But while I was trying to figure it out, I whipped up a C program to
test the system response to see if the load was really off. You can
check it out at:

http://www.rafb.net/paste/results/19YhpB15.html

I will also include it inline so people can comment:


#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#include <sys/time.h>

int main(void) {
	struct timeval tv_start, tv_stop;
   unsigned int usec = 20000; // fastest response that I've seen on Linux

	while (1) {
		if (gettimeofday(&tv_start, NULL) == -1) {
			fprintf(stderr, "gettimeofday() failed to set tv_start\n");
			exit(1);
		}
		if (usleep(usec) == -1) {
			fprintf(stderr, "usleep() failed to sleep for %u microseconds\n",
				usec);
			exit(2);
		}
		if (gettimeofday(&tv_stop, NULL) == -1) {
			fprintf(stderr, "gettimeofday() failed to set tv_stop\n");
			exit(1);
		}

		if (tv_stop.tv_usec < tv_start.tv_usec) tv_stop.tv_usec += 1000000;
		fprintf(stderr, "Tried to sleep for %u microseconds, actually slept "
			"for %u microseconds\n", usec,
			(unsigned int)(tv_stop.tv_usec - tv_start.tv_usec));

		usec += 20000;

		if (usec == 1000000) break;

       sleep(1);
	}

	return 0;
} // main()


So, thoughts? This is extremely unscientific, but do you think it
might give some indication of a problem on a system that is *really*
heavily loaded? My idea was that if the system was busy, this would
report trying to sleep for a short time, but actually sleeping for a
much longer one, due to slow context switches. The "sleep(1);" line at
the end of the main loop is to ensure at least one context switch
occurs between usleep() syscalls.

From test runs, I found data like the following on an unloaded machine:

: jglover@example.com; gcc -Wall -o usleep usleep.c
: jglover@example.com; ./usleep
Tried to sleep for 20000 microseconds, actually slept for 27528 microseconds
Tried to sleep for 40000 microseconds, actually slept for 50047 microseconds
Tried to sleep for 60000 microseconds, actually slept for 69989 microseconds
Tried to sleep for 80000 microseconds, actually slept for 90004 microseconds
Tried to sleep for 100000 microseconds, actually slept for 109777 microseconds
[...]
Tried to sleep for 500000 microseconds, actually slept for 510023 microseconds
Tried to sleep for 520000 microseconds, actually slept for 530119 microseconds
Tried to sleep for 540000 microseconds, actually slept for 550026 microseconds
Tried to sleep for 560000 microseconds, actually slept for 570025 microseconds
Tried to sleep for 580000 microseconds, actually slept for 590027 microseconds
[...]
Tried to sleep for 900000 microseconds, actually slept for 910042 microseconds
Tried to sleep for 920000 microseconds, actually slept for 930042 microseconds
Tried to sleep for 940000 microseconds, actually slept for 950062 microseconds
Tried to sleep for 960000 microseconds, actually slept for 970053 microseconds
Tried to sleep for 980000 microseconds, actually slept for 990054 microseconds

Cheers,
Josh


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links