In order to make troubleshooting as easy as possible, you should always use an organized methodology. Using simple best practices will do just that.
The best tip when it comes to troubleshooting best practices is to document all of your operations. This will prove helpful in critical situation as you will be able to find out about service dependencies, permission issues, etc. Start with quick fixes: if a problem sounds familiar, try using a couple of quick tricks. This often addresses the issue. Do not act randomly: use a proper order to find a problem. E.g. beginning by looking at hardware, then software, looking at recent changes, looking at logs, asking the user about the nature of the problem (sometimes the problem can be the user), etc. If all symptoms seem to point at a certain service or process, you can kill and restart it.
You are expected to be able to inspect and determine cause of errors from system log files using such commands as locate, find, grep, ? , <, >, >>, cat, tail.
A lot of error messages in linux come from different versions of software and the dependencies associated with them. If you change or update a php package for example, a php based program might stop working. You should use the rpm command to view proper dependencies, document any changes and verify dependencies before making any changes.
Troubleshooting the file system
To verify and repair a file system, you can use the mount command to enumerate the different partitions on the system and the fsck command to repair them.
You can use the DF command to see the space used on each disk. Problems can occur when a disk is full.
Troubleshooting the boot process
Even with the strongest file systems, failure will happen. You may encounter situations where the system boots in single user mode. This is an operating mode that doesn’t start all daemons and is useful for troubleshooting. In this mode you will be given the opportunity to use different troubleshooting tools including file system integrity usingfsck. In the case where a system won’t boot, it is a good idea to boot from a floppy and inspect the filesystem and boot sector. A boot disk should always contain fsck as it will enable you to repair and rescue a damaged file system.
Troubleshooting backup and restore errors
Backups can fail for many reasons. The most common causes are media and drive related issues. Most media requires proper maintenance and cleaning. Tape corruption, low device space or write failures are common problems. Proprietary software will have specific error messages and you should refer to your software provider to verify them. Backups should always be handled with care. You should do a regular restore test as it is not uncommon to see successful backups that cannot be successfully restored.
Linux, being based on one of the oldest network operating systems (UNIX), is loaded with standard troubleshooting tools. Some of these tools are:
- Ping: the ping utility enables you to verify basic connectivity between two machines.
- Route: the route utility helps you take a look at the various routes defined within the Kernels routing table. You will be able to add, delete, and modify routing information here. This is very helpful when using your Linux box as a router or firewall.
- Traceroute: this utility enables you to see every router between your Linux machine and a given host. This way it is possible to see any failing point between you and this host.
- Netstat: this utility helps you see your network interfaces statistics.
- Lsof: this utility lets you see any open files.
- Ifconfig: this utility lets you see your network interfaces and modify certain settings.
Also published on Medium.