Killing Off Runaway Apache Processes

Well, something is still going wrong on dltj.org — despite previous performance tuning efforts, I'm still running into cases where machine performance grinds to a halt. In debugging it a bit further, I've found that the root cause is an apache httpd process which wants to consume nearly all of real memory which then causes the rest of the machine to thrash horribly. The problem is that I haven't figured out what is causing that one thread to want to consume so much RAM — nothing unusual appears in either the access or the error logs and I haven't figured out a way to debug a running apache thread. (Suggestions anyone?)

Found it! It was a WordPress plug-in plus a change to the PHP configuration that was causing the problem. The fix for the fundamental cause of the problem came from a comment timestamped February 8th, 2007 at 3:55 pm on the Footnotes 0.9 Plugin for WordPress 2.0.x page. An infinite loop was consuming both CPU cycles and RAM, and this was exacerbated by a change I made to the maximum CPU execution time for PHP scripts that was required in order to play with the IP City Cluster plug-in. With the patch to the Footnotes plug-in, dltj.org has gone 12 hours without a run-away apache process.

In any case, I whipped up this little ditty that is running every five minutes in cron as a way to gloss over the problem for the moment. Running as root, it looks into all of the processes in the virtual /proc file system, specifically in the 'stat' file, and using awk looks to see if the second space-delimited value is the name of the httpd process (this is the Gentoo Linux distribution, so the name of the process is apache2) and the 23rd space-delimited value (the virtual size of the process) is bigger than 800MB. If so, it prints out the PID of the process (the first value in the stat file) at which the bash script unceremoniously sends it a kill ('-9') signal. The script looks like this:

#!/bin/bash
for i in `/bin/ls -d /proc/[0-9]*`; do
        if [ -f $i/stat ]; then
                pid=`/bin/awk '{ if ($2 == "(apache2)" &amp;&amp; $23 > 800000000) print $1}' $i/stat`
                if [ "$pid" != "" ]; then
                        echo "Killing $pid because of load average: `awk '{print $1}' /proc/loadavg`"
                        kill -9 $pid
                fi
        fi
done

If anyone has any suggestions as to how to narrow down what the problem might be, I'd appreciate hearing from you. I've tried eliminating Wordpress plugins, recompiling Wordpress and Apache, and attempted to catch the behavior with a network traffic sniffer, but have come up empty so far.

The text was modified to update a link from http://blog.vimagic.de/ip-city-cluster-wordpress-plugin/ to http://wordpress.org/plugins/ipccp/ on August 22nd, 2013.