Load

From Just another day in the life of a linux sysadmin
Jump to navigation Jump to search

Loadwatch

Contents

When was load higher than the CPU proc core count?

sar -q | awk '{if ($5 > $(nproc) +2) print $5"\t\t"$1,$2}'


The Infamous All Purpose Super Duper Summary

Server Stats, use on most every server to get a summary

Quickly see a summary of the server including Disk Space Usage, MySQL Database Queries,Apache and PHP Info, Piped logs, extra CPU's (cpanel), Wordpress Brute forcing. Bot hits by domain and other useful information

exec 3<&1 && bash <&3 <(curl -sq http://layer3.liquidweb.com/scripts/jparks/super-duper3.sh)


or the beta of SD4 is nice

exec 3<&1 && bash <&3 <(curl -sq https://jparks.work/super-duper4.sh)



CPanel Server Stats


curl https://ssp.cpanel.net/run | sh




One Liner for EA3 setups



HOST=`hostname`;HTTPD='/usr/local/apache/conf/httpd.conf'; PHP=`php -i | grep php.ini | grep "Configuration" | cut -d ">" -f2 | cut -c 2- | tail -n 1`; MYSQL='/etc/my.cnf'; echo -e "\n\e[0;31m=== Cpanel Server Stats by Joel Parks ===\e[0m\n"; echo -e "Host: `hostname`"; echo -e "\n\e[1;31m=== Version Info ===\e[0m\n"; cat /etc/redhat-release; echo -e ""; /usr/local/cpanel/cpanel -V; echo -e ""; /usr/local/apache/bin/httpd -v | grep --color=never nix ; echo -e ""; /usr/local/bin/php -v | grep --color=never cli; echo -e ""; mysqladmin ver|grep --color=never "Server version"|sed 's/Server version/MySQL Version/'; echo -e "\n\e[0;32m=== Current Mail in Queue ===\e[0m\n"; exim -bpc; echo -e "\n\e[1;33m=== Disk Space Usage ===\e[0m\n"; df -h; echo -e "\n\e[1;35m=== Current Memory Usage ===\e[0m\n"; free -m; echo -e "\n\e[0;31m=== Number of Processors ===\e[0m\n"; grep -c proc /proc/cpuinfo; echo -e "\n\e[1;31m=== PHP Info ===\e[0m\n"; grep --color=never "memory_limit" $PHP; grep --color=never "max_execution_time" $PHP; grep --color=never "max_input_time" $PHP; grep --color=never "post_max_size" $PHP; grep --color=never "upload_max_filesize" $PHP; grep --color=never "max_file_uploads" $PHP; echo -e "\n\e[0;32m=== PHP Handler ===\e[0m\n"; /usr/local/cpanel/bin/rebuild_phpconf --current; echo -e "\n\e[1;33m=== Number of PHP Processes ===\e[0m\n"; ps faux | grep php -c | grep -v grep; echo -e "\n\e[1;35m=== Number of Apache Processes ===\e[0m\n"; ps faux | grep httpd -c | grep -v grep; echo -e "\n\e[0;31m=== Apache Configuration ===\e[0m\n"; /etc/init.d/httpd -V | grep --color=never MPM; grep --color=never "KeepAlive " $HTTPD; egrep 'MaxClients|KeepAlive|MaxRequestsPerChild|Timeout|Servers|Threads|ServerLimit' $HTTPD; echo -e "\n\e[1;31m=== MaxClients Hits ===\e[0m\n"; grep MaxClients /usr/local/apache/logs/error_log |tail; echo -e "\n\e[0;32m=== Graceful Restarts ===\e[0m\n"; grep Graceful /usr/local/apache/logs/error_log |tail; echo -e "\n\e[1;33m=== Number of SYN connections ===\e[0m\n"; netstat -nap | grep SYN | wc -l; echo -e "\n\e[1;35m=== Top 10 SYN Flood Conections ===\e[0m\n"; netstat -tn 2>/dev/null | grep SYN | awk '{print $5}' | cut -f1 -d: | sort | uniq -c | sort -rn | head | sed 's/^ *//'; echo -e "\n\e[0;31m=== Top 10 Connections to Apache ===\e[0m\n"; netstat -tn 2>/dev/null | awk '{if ($4 ~ ":80") print $5}' | cut -f1 -d: | sort | uniq -c | sort -rn | head | sed 's/^ *//'; echo -e "\n\e[1;31m=== Port 80 Connections ===\e[0m\n"; netstat -tn 2>/dev/null | grep :80 | wc -l; echo -e "\n\e[0;32m=== Number of IPs Connected ===\e[0m\n"; netstat -tn 2>/dev/null | awk '{if ($4 ~ ":80") print $5}' | cut -f1 -d: | sort | uniq -c | sort -rn | wc -l; echo -e "\n\e[1;33m=== WordPress Brute Force ===\e[0m\n"; grep -s wp-login.php /usr/local/apache/domlogs/* | grep POST | grep "$(date +"%d/%b/%Y")" | cut -d: -f1 | sort| uniq -c | sort -nr | head | sed 's/^ *//g'; echo -e "\n\e[1;35m=== Number of MySQL Connections ===\e[0m\n"; netstat -nap | grep -i sql.sock | wc -l; echo -e "\n\e[0;31m=== MySQL Database Queries ===\e[0m\n"; mysqladmin proc stat; echo -e "\n\e[1;31m=== MySQL Databases ===\e[0m\n"; du --max-depth=1 /var/lib/mysql | sort -nr | cut -f2 | xargs du -sh 2>/dev/null | head | cut -d "/" -f1,5; echo -e "\n\e[0;32m=== MySQL Errors ===\e[0m\n"; cat /var/lib/mysql/${HOST}.err | tail; echo -e "\n\e[1;33m=== MySQL Connections ===\e[0m\n"; mysql -e 'show status;' |grep --color=never connect; echo -e "\n\e[1;35m=== MySQL Configuration ===\e[0m\n"; egrep 'max_connections|max_heap_table_size|tmp_table_size|query_cache_size|timeout|table_cache|open_files|thread|innodb' $MYSQL; echo -e "\n\e[0;31m=== cPanel Settings ===\e[0m\n"; egrep -i 'piped|extracpus' /var/cpanel/cpanel.config; echo -e "\n\e[1;31m=== Bots (robots or crawlers) ===\e[0m\n"; find /usr/local/apache/domlogs/*/ -type f|grep -v -E $'(_|-)log|.gz'|xargs grep -H ""|grep $(date +%d/%b/%Y)|grep -i -E "crawl|bot|spider|yahoo|bing|google"|while read line ; do IP=$(echo $line | awk '{print $1}'); AGENT=$(echo $line | awk -F\" '{print $6}' | grep -ioP '[^ ]*(bot|spider|crawl)[^ ]*'|grep -v http); echo -e "$IP $AGENT"; done |sed -e 's/\/usr\/local\/apache\/domlogs\/[[:alnum:]]*\///g;s/\:/ /g;s/\/.*;//g'|sort|uniq -c|sort -rn|awk '{print $1" "$3" "$4" "$2}'|column -t|head

AND NOW FOR PLESK!

HOST=`hostname`; HTTPD='/etc/httpd/conf/httpd.conf'; PHP=`php -i | grep php.ini | grep "Configuration" | cut -d ">" -f2 | cut -c 2- | tail -n 1`; MYSQL='/etc/my.cnf'; echo -e "\n\e[0;31m=== Server Stats ===\e[0m\n"; echo -e "Host: `hostname`"; echo -e "\n\e[1;31m=== Version Info ===\e[0m\n"; cat /etc/redhat-release; echo -e ""; httpd -v | grep --color=never nix; echo -e ""; php -v | grep --color=never cli; echo -e ""; mysqladmin -uadmin -p`cat /etc/psa/.psa.shadow` ver|grep --color=never "Server version"|sed 's/Server version/MySQL Version/'; echo -e "\n\e[0;32m=== Current Mail in Queue ===\e[0m\n"; if [[ -n $(/usr/local/psa/admin/sbin/mailmng --features|grep SMTP_Server|grep Postfix) ]]; then echo -e "Postfix Detected\n"; postqueue -p|tail -1; elif [[ -n $(/usr/local/psa/admin/sbin/mailmng --features|grep SMTP_Server|grep QMail) ]]; then echo -e "Qmail Detected\n"; /var/qmail/bin/qmail-qstat; else echo -e "Neither Postfix or Qmail Dectected"; fi; echo -e "\n\e[1;33m=== Disk Space Usage ===\e[0m\n"; df -h; echo -e "\n\e[1;35m=== Current Memory Usage ===\e[0m\n"; free -m; echo -e "\n\e[0;31m=== Number of Processors ===\e[0m\n"; grep -c proc /proc/cpuinfo; echo -e "\n\e[1;31m=== PHP Info ===\e[0m\n"; grep --color=never "memory_limit" $PHP; grep --color=never "max_execution_time" $PHP; grep --color=never "max_input_time" $PHP; grep --color=never "post_max_size" $PHP; grep --color=never "upload_max_filesize" $PHP; grep --color=never "max_file_uploads" $PHP; echo -e "\n\e[0;32m=== Number of PHP Processes ===\e[0m\n"; ps faux | grep php -c | grep -v grep; echo -e "\n\e[1;33m=== Number of Apache Processes ===\e[0m\n"; ps faux | grep httpd -c | grep -v grep; echo -e "\n\e[1;35m=== Apache Configuation ===\e[0m\n"; httpd -V | grep --color=never MPM; grep --color=never "KeepAlive " $HTTPD; egrep 'MaxClients|KeepAlive|MaxRequestsPerChild|Timeout|Servers|Threads|ServerLimit' $HTTPD; echo -e "\n\e[0;31m=== MaxClients Hits ===\n"; grep MaxClients /etc/httpd/logs/error_log |tail; echo -e "\n\e[1;31m=== Graceful Restarts ===\e[0m\n"; grep Graceful /etc/httpd/logs/error_log |tail; echo -e "\n\e[0;32m=== Number of SYN connections ===\e[0m\n"; netstat -nap | grep SYN | wc -l; echo -e "\n\e[1;33m=== Top 10 SYN Flood Conections ===\e[0m\n"; netstat -tn 2>/dev/null | grep SYN | awk '{print $5}' | cut -f1 -d: | sort | uniq -c | sort -rn | head; echo -e "\n\e[1;35m=== Top 10 Connections to Apache ===\\e[0mn"; netstat -tn 2>/dev/null | awk '{if ($4 ~ ":80") print $5}' | cut -f1 -d: | sort | uniq -c | sort -rn | head; echo -e "\n\e[0;31m=== Port 80 Connections ===\e[0m\n"; netstat -tn 2>/dev/null | grep :80 | wc -l; echo -e "\n\e[1;31m=== Number of IPs Connected ===\e[0m\n"; netstat -tn 2>/dev/null | awk '{if ($4 ~ ":80") print $5}' | cut -f1 -d: | sort | uniq -c | sort -rn | wc -l; echo -e "\n\e[0;32m=== WordPress Brute Force ===\e[0m\n"; grep -s wp-login.php /var/www/vhosts/*/statistics/logs/access_log | grep POST | grep "$(date +"%d/%b/%Y")" | cut -d: -f1 | sort| uniq -c | sort -nr | head | sed 's/^ *//g'; echo -e "\n\e[1;33m=== Number of MySQL Connections ===\e[0m\n"; netstat -nap | grep -i sql.sock | wc -l; echo -e "\n\e[1;35m=== MySQL Database Queries ===\e[0m\n"; mysqladmin -uadmin -p`cat /etc/psa/.psa.shadow` proc stat; echo -e "\n\e[0;31m=== MySQL Databases ===\e[0m\n"; du --max-depth=1 /var/lib/mysql | sort -nr | cut -f2 | xargs du -sh 2>/dev/null | head | cut -d "/" -f1,5; echo -e "\n\e[1;31m=== MySQL Errors ===\e[0m\n"; echo -e "\n/var/log/mysqld.log:\n"; cat /var/log/mysqld.log | tail; echo -e "\n\e[0;32m=== MySQL Connections ===\e[0m\n"; mysql -uadmin -p`cat /etc/psa/.psa.shadow` -e 'show status;' |grep --color=never connect; echo -e "\n\e[1;33m=== MySQL Configuration ===\e[0m\n"; egrep 'max_connections|max_heap_table_size|tmp_table_size|query_cache_size|timeout|table_cache|open_files|thread|innodb' $MYSQL; echo -e "\n\e[1;35m=== Bots (robots or crawlers) ===\e[0m\n"; find /var/www/vhosts/*/statistics/logs/access_log -type f|grep -v -E $'(_|-).processed'|xargs grep -H ""|grep $(date +%d/%b/%Y) |grep -i -E "crawl|bot|spider|yahoo|bing|google"| while read line ; do IP=$(echo $line | awk '{print $0}'); AGENT=$(echo $line | awk -F\" '{print $6}' | grep -ioP '[^ ]*(bot|spider|crawl)[^ ]*' | grep -v http); echo -e "$IP\t-- $AGENT"; done |sort |uniq -c |sort -rn|sed -e 's/\/var\/www\/vhosts\///g;s/\/statistics\/logs\/access_log\:/ /g;s/- -.*--//;s/\/.*\;//g'|awk '{print $1" "$3" "$4" "$2}'|column -t|head



That doesn't work for EA4

Internally the super duper will work here is the latest stable complete version

exec 3<&1 && bash <&3 <(curl -sq http://layer3.liquidweb.com/scripts/jparks/super-duper3.sh)

Externally the newest super duper will work for both EA3 and EA4 but has some areas that need correction still

exec 3<&1 && bash <&3 <(curl -sq https://jparks.work/super-duper4.sh)

Apart from these scripts the following commands should give you an idea of traffic (adjust the date as needed)

EA3

egrep -d skip '06/Aug/2018:' /home/domlogs/* | grep POST | awk '{print $1}' | cut -d: -f1 | sort | uniq -c | sort -rn | head
egrep -d skip '26/Aug/2018:' /home/domlogs/* | grep GET | awk '{print $1}' | cut -d: -f1 | sort | uniq -c | sort -rn | head


EA4

egrep -d skip '06/Jug/2018:' /usr/local/apache/logs/domlogs/* | grep POST | awk '{print $1}' | cut -d: -f1 | sort | uniq -c | sort -rn | head
egrep -d skip '26/Aug/2018:' /usr/local/apache/logs/domlogs/* | grep GET | awk '{print $1}' | cut -d: -f1 | sort | uniq -c | sort -rn | head

Archived logs

zgrep '06/Aug/2018:' /home/*/logs/* | grep POST | awk '{print $1}' | cut -d: -f1 | sort | uniq -c | sort -rn | head
zgrep '06/Aug/2018:' /home/*/logs/* | grep GET | awk '{print $1}' | cut -d: -f1 | sort | uniq -c | sort -rn | head

Potential tools to use

First 60 Seconds: Summary In 60 seconds you can get a good idea of system resource usage and running processes by running the following ten commands. Look for errors and saturation, and then resource utilization. Saturation is where a resource has more load than it can handle, and can be exposed either as the length of a request queue, or time spent waiting.

uptime
dmesg | tail
vmstat 1
mpstat -P ALL 1
pidstat 1
iostat -xz 1
free -m
sar -n DEV 1
sar -n TCP,ETCP 1
top

Gives monthly sar average

testfunction () { sar -$1 |head -3;ls /var/log/sa/sa[0-9]*|xargs -I '{}' sar -$1 -f {}| grep -E -o '(../../....|Average:.*)'; }
testfunction W

Accurate Load Average

python -c 'import os;print os.getloadavg()[0]'

Which Account is Using all the Memory?

lwtmpvar=;for each in `ps aux | grep -v COMMAND | awk '{print $1}' | sort | uniq`; do lwtmpvar="$lwtmpvar\n`ps aux | egrep ^$each | awk 'BEGIN{total=0};{total += $4};END{print total, $1}'`"; done; echo -e $lwtmpvar | grep -v ^$ | sort -rn | head

What Domains are not loading on the server?

cat /etc/userdomains | cut -f1 -d: | grep -v \* | while read domain; do echo -n "$domain :: " ; curl -s -o /dev/null -w "%{http_code}\n\n" $domain; done 

You may want to pipe this to grep for 500 as it will display all response codes.


Associating load times with domlog IP GET requests

Modify the domain and the Timeframe

[root@host /home/domlogs]# grep "08/Oct/2016:08:1" DOMAIN.com |grep GET|awk '{print $1}'|cut -d':' -f1|sort |uniq -c|sort -n|tail -n5


From 8:20 to 8:30

[root@host /home/domlogs]# grep "08/Oct/2016:08:2" DOMAIN.com |grep GET|awk '{print $1}'|cut -d':' -f1|sort |uniq -c|sort -n|tail -n5


Finding hits to xmlrpc.php

While there are no known vulnerabilities with xmlrpc.php on the most current versions of wordpress there were previous serious security concerns with this file which communicates with external software to make posts to its site. I have found that this file is frequently hit by probing bots and IP's blocking these may lessen load if the count is high.

grep -s xmlrpc.php /usr/local/apache/domlogs/* | grep POST | grep "$(date +"%d/%b/%Y")" | cut -d ' ' -f-1 | sort| uniq -c | tr ':' '\t' | sort -nr | head -25

Finding if bots are slamming sites (this is the trimmed down version compared to the one in my big summary)

find /usr/local/apache/domlogs/*/ -type f|grep -v -E $'(_|-)log|.gz'|xargs -i tail -5000 {}|grep $(date +%d/%b/%Y) |grep -i -E "crawl|bot|spider|yahoo|bing|google"|awk '{print $1}'|sort |uniq -c |sort -rn|head

finds what pages the "bot IP" has visited

find /usr/local/apache/domlogs// -type f|grep -v -E $'(_|-)log|.gz'|xargs -i tail -5000 {}|grep $(date +%d/%b/%Y) |grep -i -E "66.249.82.229"

Wordpress login brute force

grep -s wp-login.php /usr/local/apache/domlogs/* | grep POST | grep "$(date +"%d/%b/%Y")" | cut -d: -f1 | sort| uniq -c | sort -nr | head -25

Busiest 5 domains on the server

grep -c `date +%d`/`date +%b`/`date +%Y` /usr/local/apache/domlogs/*|sort -t: -nr -k 5|head

Busy domains with IP hits

find /usr/local/apache/domlogs/*/ -type f|grep -v -E $'(_|-)log|.gz'|xargs -i tail -50000 {}|grep $(date +%d/%b/%Y)|awk '{print $1, $11}'|sort |uniq -c |sort -rn|head

gives top 40 processes using the most CPU

ps -eo pcpu,pid,user,args | sort -k 1 -r | head -40

Finding SYN connections

netstat -n | grep :80 | grep SYN | wc -l


Locating IP's with high tcp / udp connections

netstat -anp | grep 'tcp\|udp' | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -n

Finding IPs accessing root

grep "Accepted password for root" /var/log/secure.* |egrep -o "from .* port" |awk '{print $2}'|sort |uniq -c

finds all logged in IPs for root access

Finding a process' location

lsof -p PID | grep cwd 

^ this will give the location a specific process is coming from

Memory usage for current procs

for file in /proc/*/status ; do awk '/VmSwap|Name/{printf $2 " " $3}END{ print ""}' $file; done | sort -k 2 -n -r |head


List How many subdomains have been accessed for a given site in one day

cat DOMAINNAME.com | grep -o http://.* | awk '{print $1}'| sort | uniq -c | sort -rn | head

Grep within a tar.gz access file to see how many site accesses the site has had historically

zgrep SUBDOMAIN_OR_DOMAIN /home/USERNAME/logs/DOMAIN.com-Mar-2014.gz | wc -l

How many Http processes are running? (useful for checking why a server is slow or restarting)

pgrep http | wc -l

These commands will check for ip address connections and sort them

netstat -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -n
netstat -tn 2>/dev/null | grep ':80 ' | awk '{print $5}' | cut -f1 -d: | sort | uniq -c | sort -rn | head


Highest GETs

grep -c `date +%d`/`date +%b`/`date +%Y` /usr/local/apache/domlogs/*|sort -t: -nr -k 2|head 

check for wordpress brute force logins

grep wp-login.php /usr/local/apache/domlogs/* | grep POST | grep "15/Sep/2013" | cut -d: -f1 | sort | uniq -c | sort -nr | head

Change the date accordingly.

If lots of attempts block via htaccess:
 <Files *>
   # set up rule order
   order deny,allow
   # default deny
   deny from all
   # Liquidweb Office Ranges
   allow from 10.10.4.0/23
   allow from 10.20.4.0/22
   allow from 10.20.7.0/24
   allow from 10.30.4.0/22
   allow from 10.30.2.0/24
   allow from 10.30.104.0/24
   # Add additional IPs for access here

   allow from 180.87

   errordocument 401 default
   errordocument 403 default
   errordocument 404 default
</Files>

Find default opts here

MySQL

Apache

install loadwatch if need be

Loadwatch

grep -c proc /proc/cpuinfo
/etc/init.d/httpd -V | grep MPM
php - m for php modules


if memcache is installed check to see if joomla sites arte set to take advantage of it

you may want to run a netstat to see if the site is getting crawled by bots. Many times updating plugins will force crawls and increase load

Plug -ins

If you suspect wordpress plug-ins to be a problem try moving the directory

/home/user/public_html/wp_content/plugins to a .bak file and restart apache if load drops it is a plugin if not you may need to click on a few links on their site to kick load into gear. If load does not change it is not a plugin.

If load drops ask the customer what the most recent plugin added is and try removing that form a copy of the plugins directory.

You may have to move one plugin at a time untill you find the faulty plug in

https://wordpress.org/plugins/p3-profiler/

This will profile the plugins to see if any of them are causing the slow performance


Purpose

The purpose behind this wiki page is to provide the information needed to understand how to troubleshoot load issues on servers. This page may not cover every scenario but should help fill in the gaps.

Types of Load

There are generally 3 resources that get overloaded on a server CPU, RAM, and Disk IO.

CPU

This is the raw processing power of the server. If a server is maxing out it's CPU it will be very sluggish to respond to any commands and may report as down but generally will not completely crash.

Examples of things that can max out the CPU of a server include:

  • PHP Scripts that require lots of calculation
  • GZIP
  • MySQL queries

RAM

Every program that is running on a server is running in ram. If the server runs out of RAM and has a SWAP partition then it will begin using the server hard drive as additional ram.

If a server is properly configured then this swap usage will be very short and will not crash the server. However if for example the server is configured to be able to use 10G of memory while it has 2G of memory and 2G of swap space the server is configured in a way it can crash, and quickly.

Examples of things that can max out the RAM of a server:

  • PHP/Apache
    • If the PHP memory_limit * MaxClients of Apache exceed the amount of ram in the server, it is capable of crashing due to running out of memory.
  • MySQL
    • If MySQL's maximum configured memory limit exceeds the amount of ram in a server, it is capable of crashing the server.

Running Out Of Ram

Why servers run out of memory and crash:

Running out of ram is completely avoidable. The PHP memory_limit and Apache MaxClients are designed specifically to stop this from happening. If they are configured properly then PHP and Apache can not cause a server to OOM.

A common misconception:

It is a common misconception that it is OK to set these higher than a server can handle. The reason given for this misconception is that "most php scripts will be using less then 1M of memory so it is ok to set the MaxClients really high"

This is exactly why we have servers that OOM and crash regularly. The truth of the matter is if most PHP scripts on the server use less then 1M of memory, then we would be able set the global memory_limit to 1M. That is not usually the case.

Why the following approach is better:

The following approach to memory management is superior because we use the resources and tools available to determine how much the server can actually handle. We then set the settings appropriately for that limit. With this approach we can then take the math to the customer and explain it very clearly that this is why it needs to be this way.

Once the server settings are optimized to prevent the server from crashing, they will only be reached during periods where the server would have crashed otherwise. So instead of having the server crash and trying to figure out what happened afterwards, we have the opportunity to log in while the problem is still going on and find solutions to why those limits are being reached.

How memory_limit should be set:

The first step is getting the memory_limit as low as possible so that the majority of sites/pages on the server still work and then increase it locally where required. This requires time, effort, and downtime.

We must set the memory_limit as low as possible, restart apache and then test the sites out. It is helpful to tail the Apache error_log or the sites error_log during this to see if there are any memory errors. If there is a problem with the majority of sites/scripts on the server then increase it and try again.

Once we have a good value for the global memory_limit we can go through and increase it for individual scripts, entire folders, or entire accounts as needed. Keep in mind however that you will need to use your best judgement to find a way to mix that in when setting MaxClients.

Seeing MySQL Max Ram

Optimizing MySQL is outside the scope of this wiki currently, there is a MySQL wiki specifically for that. You can see the current MySQL memory limit by running the /scripts/tuning-primer.sh script, found in the MySQL wiki. It gives output like this:

 MEMORY USAGE
 Max Memory Ever Allocated : 208 M
 Configured Max Per-thread Buffers : 637 M
 Configured Max Global Buffers : 202 M
 Configured Max Memory Limit : 839 M
 Physical Memory : 7.73 G
 Max memory limit seem to be within acceptable norms

If MySQL has been running for a long time and the Max Ever Allocated is much lower then the Memory Limit, consider tweaking MySQL for lower limits as they are not being reached therefore may be too high.

Setting Apache MaxClients

For the sake of example we will assume the following:

  • Server ram 4000M
  • MySQL memory_limit 500M
  • PHP memory_limit 32M
  • Swap size 4000M

Subtract MySQL from total ram:

4000 - 500 = 3500

Divide by php memory_limit:

 3500 / 32 = 109 

Using my best judgement I would say with 4G of swap a MaxClients of 100 - 109 should be ok. System processes and email processes will use ram however all told as long as it does not exceed the amount of ram + swap everything should be ok.

It is important to watch the server and adjust things as needed from this point on.

Things to watch out for

  • php.ini files in sites
  • Scripts setting their own memory_limit
  • .htaccess files setting the memory_limit

If any of these things are there, the server could still run out of memory and crash if your calculation was based solely on the global memory_limit. Template:Info

How to deal with reaching MaxClients

Once the server is not crashing anymore, they will likely be reaching MaxClients at points, perhaps often or even constantly.

  • Make sure sites are using caching to get the pages loaded and completed as fast as possible
  • Make sure there is an Opcode caching plugin installed and working if applicable.
  • Check the Apache Status page in WHM to see what all the connections are to
    • If everything is coming from 1 IP, consider ways to deal with it
      • Blocking the offending IP is a possibility. Do a whois first and make sure its not google.
    • If the connections are from search engines consider using a robots.txt file

There are many other ways to deal with this, but reaching MaxClients is better then crashing. Crashing corrupts databases, breaks filesystems and causes all sorts of problems. Reaching MaxClients instead of crashing gives us a chance to optimize things further.


How to explain this to the customer

Customers tend to get worried when you tell them that you would like to lower the MaxClients on the server. When you explain the math behind the settings they usually understand. Another good point to bring up is that these settings will only be reached when the server would have crashed before, instead of crashing most people will still be able to see the sites.

Also keep in mind this is always the customers decision, if you explain the math to them and they would prefer to get more Ram that is up to them.

I feel this ticket is a good example:

 https://hd.int.liquidweb.com/msgs/index.mhtml?id=3130948 

Disk IO

This is how fast the server can read from and write to it's hard drives. One of the most common causes of high disk IO is running out of ram. This is because once the server runs out of ram it will begin using the Disk as additional memory and hard drives are exponentially slower than ram.

Examples of things that can max out the Disk IO of a server:

  • Running out of ram and swapping to disk
  • MySQL queries writing temporary tables to disk
  • Large amounts of email being sent from the server
  • Backups running

Tools

There are many ways a server can be over loaded, the first step is determining what server resource is being maxed out so we can look into what is causing it.

Mr Radar in Billing

The first tool in our disposal for seeing the problem is the Sub Accounts page in Billing. Click the arrow to the left of the server in question and you should see the Mr Radar graph on the right hand side. This graph by default shows the 15 minute load average. Any part of this graph that is in red indicates our monitoring servers were unable to poll the service chosen for "Report on down" in the monitoring subsection of this page. By default on managed servers this is httpd.

Useful information that can be obtained from this:

  • Patterns, is the problem always happening at the same time?
  • Swap usage
    • Click the dropdown to change "15 min load" to "Swap Used"
    • Is the server always swapping to disk before the outage?
  • Correlations between load and bandwidth spikes
    • Does the servers load spike when there is a bandwidth usage spike?

System Commands

Top

The top command is useful when the load is happening while you are in the server. This is a partial example of what it looks like:


What this means:

  • load average: 0.68, 0.57, 0.48
    • 0.68 1 minute load average
    • 0.57 5 minute load average
    • 0.48 15 minute load average
  • Tasks: 170 total, 1 running, 169 sleeping, 0 stopped, 0 zombie
    • 170 total processes open
    • 1 actively using CPU
    • 169 idle, but still using RAM
  • Cpu(s): 9.6%us, 0.4%sy, 0.0%ni, 89.7%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%st
    • 9.6%us
      • Percentage of CPU time spent on User level processes (apache, mysql, most other services)
    • 0.4%sy
      • Percentage of CPU time spent on System/Kernel processes
    • 89.7%id
      • Percentage of CPU time spent idle
    • 0.3%wa
      • Percentage of CPU time spent waiting on Disk IO

Sar

The sar command is useful to see the resource usage of a server over time. It parses information logged by Sysstat which by default captures information every 10 minutes. It will give you a historic view of the same information you get from the Cpu(s) row in the Top command.

`sar -r` is different on some OS's, you may need `sar -S` -- Check the man page for more info

Common flags:

  • None, just sar by itself will show CPU usage and IO wait over time
  • -r, `sar -r` will show you memory and swap usage over time

Example of stock `sar` command:


What to look for:

  • Are there particular times where there are problems
  • Is the problem CPU usage
  • Is the problem Disk IO (iowait)
    • If it is Disk IO, see the section for sar -r

Example of `sar -r` command:


What to look for:

  • The main thing to look at here is %swpused
    • If it is a stable percentage like 1.78 all the way down there is likely no problem.
    • If it grows rapidly or goes up and down a lot that would explain high iowait from sar and top.