Monitoring and Alerting Setup

Category: DevOps and System Tools
Type: Linux Commands
Generated on: 2025-07-10 03:23:18
For: System Administration, Development & Technical Interviews

Monitoring and Alerting Setup (Linux Commands - DevOps and System Tools) - Cheat Sheet

This cheat sheet provides a comprehensive overview of Linux commands and tools used for monitoring and alerting, focusing on DevOps and system administration tasks.

1. Command Overview

This section covers commands for monitoring system resources, logs, and network traffic, and setting up alerts based on predefined thresholds.

top / htop: Real-time process monitoring and system resource usage. htop is a more user-friendly, interactive version of top.
vmstat: Virtual memory statistics - reports information about processes, memory, paging, block IO, traps, and CPU activity.
iostat: Input/output statistics for devices. Reports disk I/O activity.
df: Disk space usage. Reports file system disk space usage.
du: Disk usage per directory. Estimate file space usage.
free: Memory usage. Displays the total amount of free and used physical and swap memory in the system.
netstat / ss: Network statistics. ss is the modern replacement for netstat.
tcpdump: Network packet analyzer. Captures and analyzes network traffic.
ping: Tests network connectivity. Sends ICMP echo requests to a host.
traceroute: Traces the route packets take to a host.
uptime: System uptime and load average.
sar: System activity reporter. Collects, reports, and saves system activity information.
journalctl: View and manage systemd journal logs.
tail: Displays the last part of a file. Used for monitoring log files.
grep: Search for patterns in files. Used for filtering log files.
awk: Powerful text processing tool, useful for parsing log files and extracting data.
sed: Stream editor for transforming text.
watch: Executes a command periodically and displays the output.
sensors: Monitors hardware sensors, such as temperature and voltage.
uptime-kuma: Self-hosted monitoring tool with a web UI. (Requires installation)
Prometheus: Time-series database and monitoring system (Requires installation and configuration).
Grafana: Data visualization and dashboarding tool (Requires installation and configuration).
Alertmanager: Handles alerts sent by Prometheus (Requires installation and configuration).

2. Basic Syntax

This section outlines the basic syntax for each command.

top:
Terminal window
```
top [options]
```
htop:
Terminal window
```
htop [options]
```
vmstat:
Terminal window
```
vmstat [delay] [count]
```

iostat:

iostat [options] [device...] [interval] [count]

df:
Terminal window
```
df [options] [file...]
```
du:
Terminal window
```
du [options] [file...]
```
free:
Terminal window
```
free [options]
```
netstat / ss:
Terminal window
```
netstat [options]
ss [options]
```
tcpdump:
Terminal window
```
tcpdump [options] [expression]
```
ping:
Terminal window
```
ping [options] host
```
traceroute:
Terminal window
```
traceroute [options] host
```
uptime:
Terminal window
```
uptime
```
sar:
Terminal window
```
sar [options] [interval] [count]
```
journalctl:
Terminal window
```
journalctl [options]
```
tail:
Terminal window
```
tail [options] file
```
grep:
Terminal window
```
grep [options] pattern [file...]
```
awk:
Terminal window
```
awk 'pattern { action }' file
```
sed:
Terminal window
```
sed 's/pattern/replacement/g' file
```
watch:
Terminal window
```
watch [options] command
```
sensors:
Terminal window
```
sensors
```

3. Practical Examples

This section provides practical examples of using these commands.

top: Monitor CPU and memory usage.

top

top - 14:32:15 up 1 day,  2:15,  1 user,  load average: 0.01, 0.05, 0.08
Tasks: 154 total,   1 running, 153 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.3 us,  0.3 sy,  0.0 ni, 99.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  1999948 total,   139048 free,  1356200 used,   504700 buff/cache
KiB Swap:  2097148 total,  2097148 free,        0 used.   507156 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    1 root      20   0  167308   5544   3860 S   0.0  0.3   0:06.25 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kthreadd
    3 root       0  -20       0      0      0 I   0.0  0.0   0:00.00 rcu_gp
    4 root       0  -20       0      0      0 I   0.0  0.0   0:00.00 rcu_par_gp
    6 root       0  -20       0      0      0 I   0.0  0.0   0:00.00 kworker/0:0H-kblockd
    8 root       0  -20       0      0      0 I   0.0  0.0   0:00.00 mm_percpu_wq
    9 root      20   0       0      0      0 S   0.0  0.0   0:00.00 ksoftirqd/0
   10 root      20   0       0      0      0 I   0.0  0.0   0:00.00 rcu_sched

htop: Interactive process monitoring.
Terminal window
```
htop
```
(Requires installation: sudo apt install htop or sudo yum install htop)

vmstat 1 5: Show virtual memory stats every 1 second, 5 times.

vmstat 1 5

procs -----------memory---------- ---swap-- -----io---- -system-- --------cpu--------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 140364   9860 506008    0    0     0     0   11   12  0  0 99  0  0
 0  0      0 140364   9860 506008    0    0     0     0   10   12  0  0 100  0  0
 0  0      0 140364   9860 506008    0    0     0     0   10   12  0  0 100  0  0
 0  0      0 140364   9860 506008    0    0     0     0   10   12  0  0 100  0  0
 0  0      0 140364   9860 506008    0    0     0     0   10   12  0  0 100  0  0

iostat -x 1 5: Show extended I/O statistics every 1 second, 5 times.

iostat -x 1 5

Linux 5.15.0-101-generic (hostname)  11/03/2024  _x86_64_        (1 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
       0.10    0.00    0.10    0.00    0.00   99.80

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await  aqu-sz rareq-sz wareq-sz  svctm  %util
sda              0.00    0.10      0.00      0.80     0.00     0.00   0.00   0.00    0.00    8.00    0.00     0.00     8.00   1.00   0.01

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
       0.00    0.00    0.00    0.00    0.00  100.00

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await  aqu-sz rareq-sz wareq-sz  svctm  %util
sda              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00    0.00     0.00     0.00   0.00   0.00

df -h: Display disk space usage in human-readable format.

df -h

Filesystem      Size  Used Avail Use% Mounted on
udev            959M     0  959M   0% /dev
tmpfs           197M  1.1M  196M   1% /run
/dev/sda1        20G  7.9G   11G  43% /
tmpfs           984M     0  984M   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sdb1       100G   60G   40G  60% /data
tmpfs           197M  4.0K  197M   1% /run/user/1000

du -sh /var/log: Show the size of /var/log directory in human-readable format.
Terminal window
```
du -sh /var/log
```
```
32M  /var/log
```

free -m: Display memory usage in megabytes.

free -m

              total        used        free      shared  buff/cache   available
Mem:           1953        1323         136          78         493         504
Swap:          2047           0        2047

ss -ltnp: Show listening TCP ports with process names.

ss -ltnp

State   Recv-Q  Send-Q   Local Address:Port     Peer Address:Port  Process
LISTEN  0       4096             0.0.0.0:22          0.0.0.0:*      users:(("sshd",pid=1133,fd=3))
LISTEN  0       4096                [::]:22             [::]:*      users:(("sshd",pid=1133,fd=4))

tcpdump -i eth0 -n port 80: Capture HTTP traffic on interface eth0.
Terminal window
```
tcpdump -i eth0 -n port 80
```
(This will output a stream of captured packets. Stop with Ctrl+C)

ping -c 4 google.com: Ping google.com 4 times.

ping -c 4 google.com

PING google.com (142.250.184.142) 56(84) bytes of data.
64 bytes from fra16s36-in-f14.1e100.net (142.250.184.142): icmp_seq=1 ttl=117 time=6.41 ms
64 bytes from fra16s36-in-f14.1e100.net (142.250.184.142): icmp_seq=2 ttl=117 time=6.50 ms
64 bytes from fra16s36-in-f14.1e100.net (142.250.184.142): icmp_seq=3 ttl=117 time=6.74 ms
64 bytes from fra16s36-in-f14.1e100.net (142.250.184.142): icmp_seq=4 ttl=117 time=6.65 ms

--- google.com ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3004ms
rtt min/avg/max/mdev = 6.412/6.578/6.742/0.122 ms

traceroute google.com: Trace the route to google.com.
Terminal window
```
traceroute google.com
```
(This will output a series of hops to the destination)

uptime: Display system uptime and load average.

uptime

 14:40:02 up 1 day,  2:22,  1 user,  load average: 0.00, 0.01, 0.05

sar -u 1 5: Report CPU utilization every 1 second, 5 times.

sar -u 1 5

Linux 5.15.0-101-generic (hostname)  11/03/2024  _x86_64_        (1 CPU)

14:40:50     CPU     %user     %nice   %system   %iowait    %steal     %idle
14:40:51     all      0.00      0.00      0.00      0.00      0.00    100.00
14:40:52     all      0.00      0.00      0.00      0.00      0.00    100.00
14:40:53     all      0.00      0.00      0.00      0.00      0.00    100.00
14:40:54     all      0.00      0.00      0.00      0.00      0.00    100.00
14:40:55     all      0.00      0.00      0.00      0.00      0.00    100.00
Average:     all      0.00      0.00      0.00      0.00      0.00    100.00

journalctl -xe: View systemd journal logs with explanations and errors.
Terminal window
```
journalctl -xe
```
(This will display a large amount of log data. Use arrow keys to navigate.)
tail -f /var/log/syslog: Follow the syslog file and display new entries in real-time.
Terminal window
```
tail -f /var/log/syslog
```
(This will continuously display new log entries. Stop with Ctrl+C)
grep "error" /var/log/syslog: Search for “error” in the syslog file.
Terminal window
```
grep "error" /var/log/syslog
```
(This will output lines containing the word “error”)
awk '/error/ {print $0}' /var/log/syslog: Use awk to print lines containing “error” from syslog.
Terminal window
```
awk '/error/ {print $0}' /var/log/syslog
```
sed 's/error/WARNING/g' /var/log/syslog: Replace all occurrences of “error” with “WARNING” in syslog (output to stdout, doesn’t modify the file). To modify the file in place: sed -i 's/error/WARNING/g' /var/log/syslog
Terminal window
```
sed 's/error/WARNING/g' /var/log/syslog
```
watch -n 1 "free -m": Run free -m every 1 second and display the output.
Terminal window
```
watch -n 1 "free -m"
```
sensors: Display hardware sensor information (requires lm-sensors package).
Terminal window
```
sensors
```
(Requires installation: sudo apt install lm-sensors or sudo yum install lm-sensors. You may need to run sudo sensors-detect after installation.)

4. Common Options

This section lists common options for each command.

top:
- -d <seconds>: Delay between updates.
- -u <user>: Show processes for a specific user.
- -p <pid>: Show processes for a specific PID.
- Shift+M: Sort by memory usage.
- Shift+P: Sort by CPU usage.
htop:
- F1: Help.
- F2: Setup.
- F3: Search.
- F6: Sort.
- k: Kill process.
vmstat:
- <delay>: Delay between updates in seconds.
- <count>: Number of updates.
- -s: Display event counters and memory statistics.
iostat:
- -x: Extended statistics.
- -d: Display only device statistics.
- -p [device] : Display statistics for block devices and their partitions.
- <interval>: Update interval in seconds.
- <count>: Number of updates.
df:
- -h: Human-readable format.
- -T: Show file system type.
- -i: Show inode information.
- -a: Include pseudo, duplicate, inaccessible file systems.
du:
- -h: Human-readable format.
- -s: Summarize disk usage.
- -c: Grand total.
- -d <depth>: Limit directory depth.
free:
- -m: Megabytes.
- -g: Gigabytes.
- -h: Human-readable.
- -s <seconds>: Update interval.
- -c <count>: Number of updates.
netstat / ss:
- -l: Listening sockets.
- -t: TCP sockets.
- -u: UDP sockets.
- -n: Numeric addresses (don’t resolve hostnames).
- -p: Show process name and PID.
- -a: All sockets.
- -i: Show network interfaces table.
- -r: Show routing table.
tcpdump:
- -i <interface>: Specify the interface to listen on.
- -n: Numeric addresses (don’t resolve hostnames).
- -nn: Don’t resolve hostnames or port names.
- -v: Verbose output.
- -vv: More verbose output.
- -w <file>: Write packets to a file.
- -r <file>: Read packets from a file.
- -c <count>: Capture only number of packets.
ping:
- -c <count>: Number of pings.
- -i <interval>: Interval between pings.
- -s <size>: Packet size.
- -t <ttl>: Time to live.
traceroute:
- -m <max_hops>: Maximum hops.
- -n: Numeric addresses (don’t resolve hostnames).
sar:
- -u: CPU utilization.
- -r: Memory utilization.
- -d: Disk utilization.
- -n DEV: Network device statistics.
- -P ALL: Per-processor statistics.
- -f <file>: Read data from a file.
journalctl:
- -xe: Explain and show errors.
- -f: Follow the log.
- -u <unit>: Show logs for a specific unit (e.g., nginx.service).
- --since <date>: Show logs since a specific date/time.
- --until <date>: Show logs until a specific date/time.
- -k: Show kernel messages.
- -b: Show logs from the current boot.
- -n <lines>: Show the last of the log.
tail:
- -f: Follow the file.
- -n <lines>: Show the last of the file.
- +<lines>: Begin output at line number .
grep:
- -i: Ignore case.
- -v: Invert match (show lines that don’t match).
- -r: Recursive search.
- -n: Show line numbers.
- -c: Count the number of matching lines.
- -l: List file names containing matches.
- -w: Match whole words only.
- -A <num>: Print lines after the matching line.
- -B <num>: Print lines before the matching line.
- -C <num>: Print lines around the matching line.
awk:
- -F <delimiter>: Specify the field delimiter.
- -v var=value: Assign a value to a variable.
- -f <file>: Read awk commands from a file.
sed:
- -i: Edit the file in place. WARNING: Use with caution, as this modifies the original file.
- -n: Suppress automatic printing of pattern space.
- s/pattern/replacement/g: Substitute pattern with replacement globally.
- d: Delete lines matching the pattern.
watch:
- -n <seconds>: Interval between updates.
- -d: Highlight the differences between successive updates.

5. Advanced Usage

This section covers more complex examples and combinations of commands.

Combining tail, grep, and awk for real-time log analysis:
Terminal window
```
tail -f /var/log/nginx/error.log | grep "error" | awk '{print $1, $3, $7}'
```
This command tails the Nginx error log, filters lines containing “error”, and then prints the timestamp, log level, and error message using awk.

Monitoring disk space usage and sending an email alert:

#!/bin/bash
THRESHOLD=90 # Disk usage threshold in percentage
USAGE=$(df -h / | awk 'NR==2 {print $5}' | tr -d '%')

if [ "$USAGE" -gt "$THRESHOLD" ]; then
  echo "Disk space on / is above $THRESHOLD%: $USAGE%" | mail -s "Disk Space Alert" admin@example.com
fi

This script checks disk usage on / and sends an email alert if it exceeds 90%. Save this script (e.g., disk_check.sh), make it executable (chmod +x disk_check.sh), and schedule it with cron (e.g., every 5 minutes).

Using sar to identify performance bottlenecks:
Terminal window
```
sar -u -d 1 10  # CPU and disk utilization every 1 second for 10 seconds
```
Analyze the output to identify high CPU usage, disk I/O bottlenecks, or other performance issues.
Creating a custom monitoring dashboard with watch:
Terminal window
```
watch -n 5 'echo "CPU Usage:"; sar -u 1 1 | tail -1; echo "Memory Usage:"; free -m | tail -1; echo "Disk I/O:"; iostat -x 1 1 | tail -1'
```
This command creates a simple dashboard that displays CPU usage, memory usage, and disk I/O statistics every 5 seconds. This is a very basic example, and more sophisticated dashboards can be created using tools like Grafana.
Using journalctl to debug service startup issues:
Terminal window
```
journalctl -u myapp.service -b  # Show logs for myapp.service from the current boot
```
This command is useful for troubleshooting issues that occur during service startup.
Monitoring network traffic with tcpdump and analyzing with tshark (Wireshark CLI):
Terminal window
```
tcpdump -i eth0 -w capture.pcap  # Capture traffic to a file

# (After capturing traffic)
tshark -r capture.pcap -T fields -e ip.src -e ip.dst -e tcp.port -e frame.len  # Extract specific fields
```
This captures network traffic and then uses tshark to extract relevant data like source and destination IPs, ports, and frame length. This is helpful for network troubleshooting and security analysis.

6. Tips & Tricks

Use aliases for frequently used commands: Add aliases to your ~/.bashrc or ~/.zshrc file. For example:
Terminal window
```
alias dfh='df -h'
alias topm='top -o %MEM'  # Sort by memory
alias tailf='tail -f'
```
Combine commands with pipes for powerful filtering and analysis: As demonstrated in the examples above, piping commands together allows for complex data processing.
Use nohup to run monitoring commands in the background:
Terminal window
```
nohup tail -f /var/log/myapp.log > myapp.log.out 2>&1 &
```
This command runs tail -f in the background, redirecting output to myapp.log.out. The 2>&1 redirects standard error to standard output.
Learn regular expressions for more powerful grep, awk, and sed usage: Regular expressions are essential for advanced text processing.
Consider using specialized monitoring tools like Prometheus and Grafana for production environments: These tools provide more advanced features like data visualization, alerting, and historical data analysis. They require setup and configuration but are well worth the effort for complex environments.
Use screen or tmux for persistent terminal sessions: This allows you to keep monitoring commands running even if your SSH connection is interrupted.

7. Troubleshooting

top or htop not showing all processes: Ensure you have sufficient permissions to view all processes. Try running with sudo.
df or du showing incorrect disk usage: Ensure the file system is mounted correctly and there are no hidden files or directories consuming space.
netstat not working: netstat is deprecated. Use ss instead.
tcpdump capturing no traffic: Double-check the interface name and filter expression. Ensure you have permissions to capture traffic.
journalctl showing no logs: Ensure the systemd journal is running and configured correctly. Check the /etc/systemd/journald.conf file.
sensors not working: Make sure you have the lm-sensors package installed and have run sudo sensors-detect.
Email alerts not being sent: Verify that your system is configured to send email and that the recipient address is valid. Check your mail server logs for errors.

ps: Process status. Provides a snapshot of current processes.
kill: Sends a signal to a process. Used to terminate or control processes.
killall: Kills processes by name.
systemctl: Controls systemd services.
crontab: Manages cron jobs (scheduled tasks).
lsof: List open files. Useful for identifying which processes are using specific files or network ports.
strace: Trace system calls made by a process. Useful for debugging.
dmesg: Display kernel messages.
iptables / nftables: Firewall configuration.

This cheat sheet provides a solid foundation for monitoring and alerting on Linux systems. Remember to adapt these commands and techniques to your specific needs and environment. Always test thoroughly before implementing changes in a production environment.