HANDY SOLUTIONS FOR WHEN ALL HELL BREAKS LOOSE
By Jeffrey Carl
It’s a sad fact that most system administration learning is done in the minutes and hours after you say the words, “Wow. I’ve never seen something get broken that way before.” Learning to be a sysadmin means that you discover how to fix all the problems that pop up, until you find a problem you’ve never run into before. Then you scramble to learn how to fix that, and you’re fine until the next new Unidentified Weird Thing™ happens. And so on.
Fortunately, about 90 percent of Unix web/mail/etc. server problems can be discovered or fixed with just a few tools – much like 90 percent of all household repairs can be done with a screwdriver, a wrench or a baseball bat. Knowing just a few likely trouble spots and troubleshooting tools can help you resolve a lot of that unidentified weirdness without getting so frustrated that you want to rip the hard drives out of the computer and make refrigerator magnets out of them.
The key here is that unlike cars or girlfriends, everything that goes wrong on a Unix system happens for a clearly defined reason. While that reason may sometimes be freakish or undocumented, it’s almost always one of a few fairly common issues.
So, with that in mind, we’re going to take a look at the Handy Tools and the Usual Suspects – the top commands and tools to use, and common places to look that will at least shed a clue on most server problems. I’m going to use FreeBSD as the example system – but most other BSDs and Linux can use the same tools, even if they act slightly differently or are located in a different place in the filesystem.
• When a server is responding slowly, you need to figure out whether the problem is on the server or in the network. After you’ve logged in to the server and become root, your first stop should be uptime.
The important part of the information it provides is the server’s load averages – shown for the last one, five and 15 minutes. If the load average is high (above two or three), the most likely cause of the slowness is one or more “runaway” processes or some other processes extensively utilizing the system. If the load average isn’t high, then you’re probably looking at a networking issue that is slowing access to the server.
• If you’ve found that a high load average is the likely culprit, turn to top. The top command lists the server’s process in order of CPU and memory utilization.
By default, top shows the top 10 processes, or you can use it in the form top N, where N is the number of processes you wish it to show. If you have one or more “runaway” processes (like the tcsh process shown above – most likely from an improperly terminated login session), you can quickly identify it and issue a kill or kill –9 (which effectively means “I don’t care what you think you’re doing, just shut up and go away”) command to the process ID number (PID) of the runaway.
• For a more complete listing of the processes that are running on your computer, use ps. The ps –auxw command (on BSD-based systems; ps –ef on System V-based systems; the ps on most Linuxes will accept either) will show all system processes owned by all users, whether active or background.
You can use this to find any active process and get its PID if you need to “re-HUP” or kill it. You can find processes for a single server by using ps in combination with the venerable grep, such as finding all Apache processes by using ps –auxw | grep httpd | grep –v grep. Compare the number of web processes to the server’s “hard” and “soft” limits (the “hard” limit is set when Apache is compiled; the “soft” limit is set in the [apache_dir]/conf/httpd.conf file for recent versions) to the number of active processes. If those numbers are close to being equal, consider either upgrading your hardware or reconfiguring/recompiling Apache with higher limits.
• If you’re worried that a user on your system is running an unauthorized program, hacking the system or otherwise foobaring things, then w is a simple check.
The w command lists active users on the system and what they’re doing. If any of them are performing unauthorized activities, simply kill that user’s shell and use vipw to either give them a password (the second colon-separated field, immediately after the username) of “*” or assign them a shell (the last field of each user’s line) of /sbin/nologin until you have sorted out the what they were doing and whether it violated your policies. A kill –9 may be necessary for “phantom” or “zombie” processes that were left running after improper logouts.
• If your problem is a crashed or non-starting Apache webserver, use the built-in apachectl command to work out the issue. It’s generally installed in the bin subdirectory of the Apache installation; if this isn’t in your shell’s command path, you may need to specify the full path to this command. Aside from the basic apachectl start and apachectl stop commands, one of the more useful options is the apachectl configtest command, which performs a basic evaluation of Apache’s httpd.conf configuration file (where almost all options are specified for Apache 1.3.4 and later).
Unfortunately, apachectl is notorious for providing “okay” readings when some configuration problems are still present (most notably when a directory specified for a virtual host is not found or not readable, which causes Apache to fail). For these situations, you’ll need to consult your Apache error logs (see below). Also, apachectl consults the file /var/run/httpd.pid to find its originating process; if this PID is different, the apachectl stop command won’t work. In these cases, find the httpd process owned by root using ps (this will be the “parent” Apache process) and kill that process.
• Your first tool for diagnosing whether a problem may in the server’s network connection rather than on the server itself is ping. Using ping to test the connection to a server is a common test, but some problems (such as an error in duplex or settings between a server and its switch) may not show up using ping normally. If a ping to a server appears normal but you suspect a network error is involved, try using ping with larger-than-normal packet sizes. The default size of the data packet used by ping is only 56 bytes, but many errors will only show up when large ping packets (2048 bytes or greater) are used. Use the –s flag with ping to specify a larger packet size (use the –c option to specify the number or “count” of pings to send).
With large packet sizes, a longer-than-usual round-trip time is normal, but excessively long times or packet loss are good indicators that there is a network configuration problem present. Try sending large ping packets for at least a count of 50, and compare the results with a long-count ping with normal packet sizes.
• If a network misconfiguration between a server and its switch (or router) is possible, then you’ll want to To show the status of your server’s network connections, use netstat -finet. Netstat will show you which ports are open or your server or which services are active, as well as what foreign host is connecting to the port or service in question.
If you’re concerned that your server is being attacked across the network, this will generally show up in excessive usage of the memory that the kernel has allocated to networking. To find this out, use the –m (memory buffer, or “mbuf”) flag for netstat. If you find that normal services like httpd aren’t heavily burdened but the percentage of memory allocated to networking is still high (90 percent or more), consider shutting down network services or ports that are open and may be being attacked or misused.
• If a network issue is the likely cause of your problem, use ifconfig (the interface configuration command) to check how the NICs (Network Interface Cards) on the server are set up.
You can ignore the lo0 (loopback) interface; what really matters are the settings for your server’s NIC(s) as specified by their driver type. These will show its IP address(es), netmask, duplex and speed, as well as which driver is in use.
Very frequently, a server which otherwise boots up and appears fine but has a problematic or nonexistent network connection can be fixed with a check of its network interface configuration. Double-check the options set for your default ifconfig startup settings in the file /etc/rc.conf (at least in recent versions of FreeBSD). Frequently, a slow network connection is the result of a NIC configured for a different speed or duplex than its switch/router port, especially when “autosense” options are set but fail for whatever reason. This can frequently be remedied by resetting the connection with a simple ifconfig down [interface] [options] OPTIONS followed by an ifconfig up [interface] [options] command.
• Weird errors with files or services may sometimes be caused by a full hard drive (preventing the system from writing logfiles or other operations). Use the df command to show your server’s mounted partitions and their available capacity.
• A whole nasty horde of seemingly inexplicable problems are caused by simple issues with file permissions. In these cases, the humble ls command can be your best ally. Using ls –l will show you the permissions settings for files in any directory. Common issues include missing “x” (executable) permissions on CGI scripts or applications, or directory permissions which don’t allow “r” (reading) or “x” (entering).
• When bizarre things are happening, the system logfiles are the first place to check. Under BSD, you’ll find these in /var/log; the first place to look is /var/log/messages., where syslog deposits all the messages that aren’t specified to go into another logfile. In fact, the entire /var/log directory is home to the messages for different services – from telnet/SSH or FTP logins to SMTP and POP connections to system errors and kernel messages.
Checking these files can often provide the answers to 90 percent of “I can’t do X” messages from desperate system users. Check /etc/syslog.conf to see where the syslog daemon is sending the errors it receives; check the config files for individual applications or services to see which logfiles they’re writing to.
• If the webserver won’t start, but there aren’t any clues elsewhere, immediately look at the webserver logfiles. Using Apache, these are generally located in the file [apache_dir]/logs/error_log or something similar. Even if apachectl runs and fails while printing a simple message like “httpd: could not be started” (this message is the winner of the “FrontPage Memorial ’Duh’ Award for Unhelpful Error Handling” three years in a row), the problem will almost certainly be logged to Apache’s errors file.
For problems with specific virtual hosts on a server, check wherever their logfiles are located. This is generally specified inside that domain’s <Virtual Host> … </Virtual Host> directive in the httpd.conf file. If no error logfile is specified for that virtual host, then errors will be logged to the main Apache error file.
Of course, all of the above are merely a few recommendations derived from my experience; if you have found other “First Aid Tools” or “Usual Suspects” that you rely on for server administration, please let me know at email@example.com and I’ll include them in an upcoming column.