You might be familiar with Linux load averages already. Load averages are the three numbers shown with the uptime and top commands – they look like this:
load average: 0.10 (1min), 0.08 (5min), 0.01 (15min)
Simply put, this is the number of blocking processes in the run queue averaged over a certain time period. This average value is calculated as the exponentially damped/weighted moving average of the load number (each process using or waiting for CPU increments the load number by 1) but this is out of the scope of this post.
What is a blocking process?
A blocking process is a process that is waiting for something to continue. Typically, a process is waiting for:
- Disk I/O
- Network I/O
Most UNIX systems count only processes in the running (on CPU) or runnable (waiting for CPU) states. However, Linux also includes processes in uninterruptible sleep states (usually waiting for disk activity), which can lead to markedly different results if many processes remain blocked in I/O due to a busy or stalled I/O system. This, for example, includes processes blocking due to an NFS server failure or to slow media (e.g., USB 1.x storage devices). Such circumstances can result in an elevated load average, which does not reflect an actual increase in CPU use (but still gives an idea on how long users have to wait).
What constitutes “good” and “bad” load average values?
It depends on the number of physical CPUs / CPU cores of your server. For example:
- On single-processor/single-core systems, load of 1.00 means 100% CPU utilization
- On a dual-core box, a load of 2.00 is 100% CPU utilization.
Which leads us to a Rule of Thumb:
- The “number of cores = max load” Rule of Thumb: on a multicore system, your load should not exceed the number of cores available.
Base on experience, I would say that 0.70 is the common threshold to determine if a system might be overloaded or there might be any kind of I/O problem. If your load average is staying above > 0.70, it’s time to investigate before things get worse.This is valid, of course, for single-processor boxes.
Which average should I be observing? One, five, or 15 minute?
You should definitely be looking at the five or 15-minute averages. Frankly, if your box spikes above 1.0 on the one-minute average, you’re still fine. It’s when the 15-minute average goes north of 1.0 and stays there that you need to snap to. (obviously, as we’ve learned, adjust these numbers to the number of processor cores your system has).
So, once we know that our system is overloaded, we must take a look at specific values with top command to look further into the problem that is causing the high load:
This number is basically representing what percentage of its’ total time the CPU is spending processing stuff. If this number is constantly around 99-100% then chances are the problem is related to your CPU, almost certainly that it is under powered, or there is a process hogging the CPU.
This number indicates if cpu is waiting on I/O. If this number is high (above 80% or so) then you have problems. This means that the CPU is spending a LOT of time waiting in I/O. This could mean that you have a failing Hard Disk, Failing Network Card, or that your applications are trying to access data on either of them at a rate significantly higher than the throughput that they are designed for.
To find out what applications are causing the load, run the command ps faux. This will list every process running on your system, and the state it is in.
You want to look in the STAT column. The common flags that you should be looking for are:
- R – Running
- S – Sleeping
- D – Waiting for something
So, look for any processes with a STAT of D, and you can go from there to diagnose the problem.