Monitoring and Alerting Setup
Category: DevOps and System Tools
Type: Linux Commands
Generated on: 2025-07-10 03:23:18
For: System Administration, Development & Technical Interviews
Monitoring and Alerting Setup (Linux Commands - DevOps and System Tools) - Cheat Sheet
Section titled “Monitoring and Alerting Setup (Linux Commands - DevOps and System Tools) - Cheat Sheet”This cheat sheet provides a comprehensive overview of Linux commands and tools used for monitoring and alerting, focusing on DevOps and system administration tasks.
1. Command Overview
Section titled “1. Command Overview”This section covers commands for monitoring system resources, logs, and network traffic, and setting up alerts based on predefined thresholds.
top/htop: Real-time process monitoring and system resource usage.htopis a more user-friendly, interactive version oftop.vmstat: Virtual memory statistics - reports information about processes, memory, paging, block IO, traps, and CPU activity.iostat: Input/output statistics for devices. Reports disk I/O activity.df: Disk space usage. Reports file system disk space usage.du: Disk usage per directory. Estimate file space usage.free: Memory usage. Displays the total amount of free and used physical and swap memory in the system.netstat/ss: Network statistics.ssis the modern replacement fornetstat.tcpdump: Network packet analyzer. Captures and analyzes network traffic.ping: Tests network connectivity. Sends ICMP echo requests to a host.traceroute: Traces the route packets take to a host.uptime: System uptime and load average.sar: System activity reporter. Collects, reports, and saves system activity information.journalctl: View and manage systemd journal logs.tail: Displays the last part of a file. Used for monitoring log files.grep: Search for patterns in files. Used for filtering log files.awk: Powerful text processing tool, useful for parsing log files and extracting data.sed: Stream editor for transforming text.watch: Executes a command periodically and displays the output.sensors: Monitors hardware sensors, such as temperature and voltage.uptime-kuma: Self-hosted monitoring tool with a web UI. (Requires installation)Prometheus: Time-series database and monitoring system (Requires installation and configuration).Grafana: Data visualization and dashboarding tool (Requires installation and configuration).Alertmanager: Handles alerts sent by Prometheus (Requires installation and configuration).
2. Basic Syntax
Section titled “2. Basic Syntax”This section outlines the basic syntax for each command.
-
top:Terminal window top [options] -
htop:Terminal window htop [options] -
vmstat:Terminal window vmstat [delay] [count] -
iostat:Terminal window iostat [options] [device...] [interval] [count] -
df:Terminal window df [options] [file...] -
du:Terminal window du [options] [file...] -
free:Terminal window free [options] -
netstat/ss:Terminal window netstat [options]ss [options] -
tcpdump:Terminal window tcpdump [options] [expression] -
ping:Terminal window ping [options] host -
traceroute:Terminal window traceroute [options] host -
uptime:Terminal window uptime -
sar:Terminal window sar [options] [interval] [count] -
journalctl:Terminal window journalctl [options] -
tail:Terminal window tail [options] file -
grep:Terminal window grep [options] pattern [file...] -
awk:Terminal window awk 'pattern { action }' file -
sed:Terminal window sed 's/pattern/replacement/g' file -
watch:Terminal window watch [options] command -
sensors:Terminal window sensors
3. Practical Examples
Section titled “3. Practical Examples”This section provides practical examples of using these commands.
-
top: Monitor CPU and memory usage.Terminal window toptop - 14:32:15 up 1 day, 2:15, 1 user, load average: 0.01, 0.05, 0.08Tasks: 154 total, 1 running, 153 sleeping, 0 stopped, 0 zombie%Cpu(s): 0.3 us, 0.3 sy, 0.0 ni, 99.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 stKiB Mem : 1999948 total, 139048 free, 1356200 used, 504700 buff/cacheKiB Swap: 2097148 total, 2097148 free, 0 used. 507156 avail MemPID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND1 root 20 0 167308 5544 3860 S 0.0 0.3 0:06.25 systemd2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_par_gp6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H-kblockd8 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq9 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/010 root 20 0 0 0 0 I 0.0 0.0 0:00.00 rcu_sched -
htop: Interactive process monitoring.Terminal window htop(Requires installation:
sudo apt install htoporsudo yum install htop) -
vmstat 1 5: Show virtual memory stats every 1 second, 5 times.Terminal window vmstat 1 5procs -----------memory---------- ---swap-- -----io---- -system-- --------cpu--------r b swpd free buff cache si so bi bo in cs us sy id wa st0 0 0 140364 9860 506008 0 0 0 0 11 12 0 0 99 0 00 0 0 140364 9860 506008 0 0 0 0 10 12 0 0 100 0 00 0 0 140364 9860 506008 0 0 0 0 10 12 0 0 100 0 00 0 0 140364 9860 506008 0 0 0 0 10 12 0 0 100 0 00 0 0 140364 9860 506008 0 0 0 0 10 12 0 0 100 0 0 -
iostat -x 1 5: Show extended I/O statistics every 1 second, 5 times.Terminal window iostat -x 1 5Linux 5.15.0-101-generic (hostname) 11/03/2024 _x86_64_ (1 CPU)avg-cpu: %user %nice %system %iowait %steal %idle0.10 0.00 0.10 0.00 0.00 99.80Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %utilsda 0.00 0.10 0.00 0.80 0.00 0.00 0.00 0.00 0.00 8.00 0.00 0.00 8.00 1.00 0.01avg-cpu: %user %nice %system %iowait %steal %idle0.00 0.00 0.00 0.00 0.00 100.00Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %utilsda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -
df -h: Display disk space usage in human-readable format.Terminal window df -hFilesystem Size Used Avail Use% Mounted onudev 959M 0 959M 0% /devtmpfs 197M 1.1M 196M 1% /run/dev/sda1 20G 7.9G 11G 43% /tmpfs 984M 0 984M 0% /dev/shmtmpfs 5.0M 0 5.0M 0% /run/lock/dev/sdb1 100G 60G 40G 60% /datatmpfs 197M 4.0K 197M 1% /run/user/1000 -
du -sh /var/log: Show the size of/var/logdirectory in human-readable format.Terminal window du -sh /var/log32M /var/log -
free -m: Display memory usage in megabytes.Terminal window free -mtotal used free shared buff/cache availableMem: 1953 1323 136 78 493 504Swap: 2047 0 2047 -
ss -ltnp: Show listening TCP ports with process names.Terminal window ss -ltnpState Recv-Q Send-Q Local Address:Port Peer Address:Port ProcessLISTEN 0 4096 0.0.0.0:22 0.0.0.0:* users:(("sshd",pid=1133,fd=3))LISTEN 0 4096 [::]:22 [::]:* users:(("sshd",pid=1133,fd=4)) -
tcpdump -i eth0 -n port 80: Capture HTTP traffic on interfaceeth0.Terminal window tcpdump -i eth0 -n port 80(This will output a stream of captured packets. Stop with Ctrl+C)
-
ping -c 4 google.com: Ping google.com 4 times.Terminal window ping -c 4 google.comPING google.com (142.250.184.142) 56(84) bytes of data.64 bytes from fra16s36-in-f14.1e100.net (142.250.184.142): icmp_seq=1 ttl=117 time=6.41 ms64 bytes from fra16s36-in-f14.1e100.net (142.250.184.142): icmp_seq=2 ttl=117 time=6.50 ms64 bytes from fra16s36-in-f14.1e100.net (142.250.184.142): icmp_seq=3 ttl=117 time=6.74 ms64 bytes from fra16s36-in-f14.1e100.net (142.250.184.142): icmp_seq=4 ttl=117 time=6.65 ms--- google.com ping statistics ---4 packets transmitted, 4 received, 0% packet loss, time 3004msrtt min/avg/max/mdev = 6.412/6.578/6.742/0.122 ms -
traceroute google.com: Trace the route to google.com.Terminal window traceroute google.com(This will output a series of hops to the destination)
-
uptime: Display system uptime and load average.Terminal window uptime14:40:02 up 1 day, 2:22, 1 user, load average: 0.00, 0.01, 0.05 -
sar -u 1 5: Report CPU utilization every 1 second, 5 times.Terminal window sar -u 1 5Linux 5.15.0-101-generic (hostname) 11/03/2024 _x86_64_ (1 CPU)14:40:50 CPU %user %nice %system %iowait %steal %idle14:40:51 all 0.00 0.00 0.00 0.00 0.00 100.0014:40:52 all 0.00 0.00 0.00 0.00 0.00 100.0014:40:53 all 0.00 0.00 0.00 0.00 0.00 100.0014:40:54 all 0.00 0.00 0.00 0.00 0.00 100.0014:40:55 all 0.00 0.00 0.00 0.00 0.00 100.00Average: all 0.00 0.00 0.00 0.00 0.00 100.00 -
journalctl -xe: View systemd journal logs with explanations and errors.Terminal window journalctl -xe(This will display a large amount of log data. Use arrow keys to navigate.)
-
tail -f /var/log/syslog: Follow the syslog file and display new entries in real-time.Terminal window tail -f /var/log/syslog(This will continuously display new log entries. Stop with Ctrl+C)
-
grep "error" /var/log/syslog: Search for “error” in the syslog file.Terminal window grep "error" /var/log/syslog(This will output lines containing the word “error”)
-
awk '/error/ {print $0}' /var/log/syslog: Use awk to print lines containing “error” from syslog.Terminal window awk '/error/ {print $0}' /var/log/syslog -
sed 's/error/WARNING/g' /var/log/syslog: Replace all occurrences of “error” with “WARNING” in syslog (output to stdout, doesn’t modify the file). To modify the file in place:sed -i 's/error/WARNING/g' /var/log/syslogTerminal window sed 's/error/WARNING/g' /var/log/syslog -
watch -n 1 "free -m": Runfree -mevery 1 second and display the output.Terminal window watch -n 1 "free -m" -
sensors: Display hardware sensor information (requireslm-sensorspackage).Terminal window sensors(Requires installation:
sudo apt install lm-sensorsorsudo yum install lm-sensors. You may need to runsudo sensors-detectafter installation.)
4. Common Options
Section titled “4. Common Options”This section lists common options for each command.
-
top:-d <seconds>: Delay between updates.-u <user>: Show processes for a specific user.-p <pid>: Show processes for a specific PID.Shift+M: Sort by memory usage.Shift+P: Sort by CPU usage.
-
htop:F1: Help.F2: Setup.F3: Search.F6: Sort.k: Kill process.
-
vmstat:<delay>: Delay between updates in seconds.<count>: Number of updates.-s: Display event counters and memory statistics.
-
iostat:-x: Extended statistics.-d: Display only device statistics.-p [device]: Display statistics for block devices and their partitions.<interval>: Update interval in seconds.<count>: Number of updates.
-
df:-h: Human-readable format.-T: Show file system type.-i: Show inode information.-a: Include pseudo, duplicate, inaccessible file systems.
-
du:-h: Human-readable format.-s: Summarize disk usage.-c: Grand total.-d <depth>: Limit directory depth.
-
free:-m: Megabytes.-g: Gigabytes.-h: Human-readable.-s <seconds>: Update interval.-c <count>: Number of updates.
-
netstat/ss:-l: Listening sockets.-t: TCP sockets.-u: UDP sockets.-n: Numeric addresses (don’t resolve hostnames).-p: Show process name and PID.-a: All sockets.-i: Show network interfaces table.-r: Show routing table.
-
tcpdump:-i <interface>: Specify the interface to listen on.-n: Numeric addresses (don’t resolve hostnames).-nn: Don’t resolve hostnames or port names.-v: Verbose output.-vv: More verbose output.-w <file>: Write packets to a file.-r <file>: Read packets from a file.-c <count>: Capture onlynumber of packets.
-
ping:-c <count>: Number of pings.-i <interval>: Interval between pings.-s <size>: Packet size.-t <ttl>: Time to live.
-
traceroute:-m <max_hops>: Maximum hops.-n: Numeric addresses (don’t resolve hostnames).
-
sar:-u: CPU utilization.-r: Memory utilization.-d: Disk utilization.-n DEV: Network device statistics.-P ALL: Per-processor statistics.-f <file>: Read data from a file.
-
journalctl:-xe: Explain and show errors.-f: Follow the log.-u <unit>: Show logs for a specific unit (e.g.,nginx.service).--since <date>: Show logs since a specific date/time.--until <date>: Show logs until a specific date/time.-k: Show kernel messages.-b: Show logs from the current boot.-n <lines>: Show the lastof the log.
-
tail:-f: Follow the file.-n <lines>: Show the lastof the file. +<lines>: Begin output at line number.
-
grep:-i: Ignore case.-v: Invert match (show lines that don’t match).-r: Recursive search.-n: Show line numbers.-c: Count the number of matching lines.-l: List file names containing matches.-w: Match whole words only.-A <num>: Printlines after the matching line. -B <num>: Printlines before the matching line. -C <num>: Printlines around the matching line.
-
awk:-F <delimiter>: Specify the field delimiter.-v var=value: Assign a value to a variable.-f <file>: Read awk commands from a file.
-
sed:-i: Edit the file in place. WARNING: Use with caution, as this modifies the original file.-n: Suppress automatic printing of pattern space.s/pattern/replacement/g: Substitute pattern with replacement globally.d: Delete lines matching the pattern.
-
watch:-n <seconds>: Interval between updates.-d: Highlight the differences between successive updates.
5. Advanced Usage
Section titled “5. Advanced Usage”This section covers more complex examples and combinations of commands.
-
Combining
tail,grep, andawkfor real-time log analysis:Terminal window tail -f /var/log/nginx/error.log | grep "error" | awk '{print $1, $3, $7}'This command tails the Nginx error log, filters lines containing “error”, and then prints the timestamp, log level, and error message using
awk. -
Monitoring disk space usage and sending an email alert:
#!/bin/bashTHRESHOLD=90 # Disk usage threshold in percentageUSAGE=$(df -h / | awk 'NR==2 {print $5}' | tr -d '%')if [ "$USAGE" -gt "$THRESHOLD" ]; thenecho "Disk space on / is above $THRESHOLD%: $USAGE%" | mail -s "Disk Space Alert" admin@example.comfiThis script checks disk usage on
/and sends an email alert if it exceeds 90%. Save this script (e.g.,disk_check.sh), make it executable (chmod +x disk_check.sh), and schedule it withcron(e.g., every 5 minutes). -
Using
sarto identify performance bottlenecks:Terminal window sar -u -d 1 10 # CPU and disk utilization every 1 second for 10 secondsAnalyze the output to identify high CPU usage, disk I/O bottlenecks, or other performance issues.
-
Creating a custom monitoring dashboard with
watch:Terminal window watch -n 5 'echo "CPU Usage:"; sar -u 1 1 | tail -1; echo "Memory Usage:"; free -m | tail -1; echo "Disk I/O:"; iostat -x 1 1 | tail -1'This command creates a simple dashboard that displays CPU usage, memory usage, and disk I/O statistics every 5 seconds. This is a very basic example, and more sophisticated dashboards can be created using tools like Grafana.
-
Using
journalctlto debug service startup issues:Terminal window journalctl -u myapp.service -b # Show logs for myapp.service from the current bootThis command is useful for troubleshooting issues that occur during service startup.
-
Monitoring network traffic with
tcpdumpand analyzing withtshark(Wireshark CLI):Terminal window tcpdump -i eth0 -w capture.pcap # Capture traffic to a file# (After capturing traffic)tshark -r capture.pcap -T fields -e ip.src -e ip.dst -e tcp.port -e frame.len # Extract specific fieldsThis captures network traffic and then uses
tsharkto extract relevant data like source and destination IPs, ports, and frame length. This is helpful for network troubleshooting and security analysis.
6. Tips & Tricks
Section titled “6. Tips & Tricks”-
Use aliases for frequently used commands: Add aliases to your
~/.bashrcor~/.zshrcfile. For example:Terminal window alias dfh='df -h'alias topm='top -o %MEM' # Sort by memoryalias tailf='tail -f' -
Combine commands with pipes for powerful filtering and analysis: As demonstrated in the examples above, piping commands together allows for complex data processing.
-
Use
nohupto run monitoring commands in the background:Terminal window nohup tail -f /var/log/myapp.log > myapp.log.out 2>&1 &This command runs
tail -fin the background, redirecting output tomyapp.log.out. The2>&1redirects standard error to standard output. -
Learn regular expressions for more powerful
grep,awk, andsedusage: Regular expressions are essential for advanced text processing. -
Consider using specialized monitoring tools like Prometheus and Grafana for production environments: These tools provide more advanced features like data visualization, alerting, and historical data analysis. They require setup and configuration but are well worth the effort for complex environments.
-
Use
screenortmuxfor persistent terminal sessions: This allows you to keep monitoring commands running even if your SSH connection is interrupted.
7. Troubleshooting
Section titled “7. Troubleshooting”-
toporhtopnot showing all processes: Ensure you have sufficient permissions to view all processes. Try running withsudo. -
dfordushowing incorrect disk usage: Ensure the file system is mounted correctly and there are no hidden files or directories consuming space. -
netstatnot working:netstatis deprecated. Usessinstead. -
tcpdumpcapturing no traffic: Double-check the interface name and filter expression. Ensure you have permissions to capture traffic. -
journalctlshowing no logs: Ensure the systemd journal is running and configured correctly. Check the/etc/systemd/journald.conffile. -
sensorsnot working: Make sure you have thelm-sensorspackage installed and have runsudo sensors-detect. -
Email alerts not being sent: Verify that your system is configured to send email and that the recipient address is valid. Check your mail server logs for errors.
8. Related Commands
Section titled “8. Related Commands”ps: Process status. Provides a snapshot of current processes.kill: Sends a signal to a process. Used to terminate or control processes.killall: Kills processes by name.systemctl: Controls systemd services.crontab: Manages cron jobs (scheduled tasks).lsof: List open files. Useful for identifying which processes are using specific files or network ports.strace: Trace system calls made by a process. Useful for debugging.dmesg: Display kernel messages.iptables/nftables: Firewall configuration.
This cheat sheet provides a solid foundation for monitoring and alerting on Linux systems. Remember to adapt these commands and techniques to your specific needs and environment. Always test thoroughly before implementing changes in a production environment.