Advanced Text Processing and Regular Expressions

Category: Advanced Linux Administration
Type: Linux Commands
Generated on: 2025-07-10 03:16:05
For: System Administration, Development & Technical Interviews

Advanced Text Processing and Regular Expressions (Linux Commands) - Cheatsheet

This cheatsheet provides a comprehensive guide to advanced text processing and regular expressions using common Linux commands, tailored for system administrators and developers.

1. Command Overview

Command	Description	When to Use
`grep`	Globally search a Regular Expression and Print. Finds lines matching a pattern.	Searching logs, configuration files, code, or any text-based data for specific information.
`sed`	Stream EDitor. Performs text transformations on a stream or file.	Replacing text, deleting lines, inserting content, or performing more complex text manipulations.
`awk`	Pattern scanning and processing language. Processes data based on fields and records.	Parsing structured data (e.g., CSV, log files), performing calculations, and generating reports.
`tr`	Translate or delete characters. Replaces or removes characters in a stream.	Converting case, removing control characters, and performing simple text transformations.
`cut`	Remove sections from each line of files. Extracts specific columns from delimited data.	Extracting data from CSV files, log files, or any data where fields are separated by delimiters.
`paste`	Merge lines of files. Joins lines from multiple files side-by-side.	Combining data from different files based on line number or a common field.
`join`	Join lines of two files on a common field. Combines data from two files based on a specified key field.	Merging data from different files based on a common identifier (e.g., user ID, product ID).

2. Basic Syntax

`grep`

grep [OPTIONS] PATTERN [FILE...]

`sed`

sed [OPTIONS] 'COMMAND' [FILE...]

`awk`

awk [OPTIONS] 'PATTERN { ACTION }' [FILE...]

`tr`

tr [OPTIONS] SET1 [SET2]

`cut`

cut [OPTIONS] [FILE...]

`paste`

paste [OPTIONS] [FILE...]

`join`

join [OPTIONS] FILE1 FILE2

3. Practical Examples

`grep`

Search for a specific string in a file:
Terminal window
```
grep "error" /var/log/syslog
```
Output: (Lines from /var/log/syslog containing “error”)
Search for a string case-insensitively:
Terminal window
```
grep -i "error" /var/log/syslog
```
Output: (Lines from /var/log/syslog containing “error”, “Error”, “ERROR”, etc.)
Search recursively in a directory:
Terminal window
```
grep -r "password" /etc/
```
Output: (File paths and lines containing “password” within the /etc/ directory and its subdirectories)

`sed`

Replace the first occurrence of a string in a file:
Terminal window
```
sed 's/old_string/new_string/' input.txt
```
Output: (Modified content of input.txt printed to standard output)
Replace all occurrences of a string in a file:
Terminal window
```
sed 's/old_string/new_string/g' input.txt
```
Output: (Modified content of input.txt printed to standard output)
Replace all occurrences and write changes to the file (in-place):
Terminal window
```
sed -i 's/old_string/new_string/g' input.txt
```
WARNING: This modifies the file directly. Consider creating a backup first. cp input.txt input.txt.bak
Delete lines containing a specific string:
Terminal window
```
sed '/error/d' /var/log/syslog
```
Output: (Lines from /var/log/syslog excluding those containing “error”)

`awk`

Print the first column of a CSV file:
Terminal window
```
awk -F',' '{print $1}' data.csv
```
Output: (The first column of each row in data.csv, separated by newlines)
Print lines where the third column is greater than 10:
Terminal window
```
awk '$3 > 10 {print}' data.txt
```
Output: (Lines from data.txt where the third field is greater than 10)
Calculate the sum of the second column:
Terminal window
```
awk '{sum += $2} END {print sum}' data.txt
```
Output: (The sum of the values in the second column of data.txt)

`tr`

Convert lowercase to uppercase:
Terminal window
```
echo "hello world" | tr '[:lower:]' '[:upper:]'
```
Output: HELLO WORLD
Delete specific characters:
Terminal window
```
echo "hello world!" | tr -d '!'
```
Output: hello world
Squeeze repeating characters:
Terminal window
```
echo "hello   world" | tr -s ' '
```
Output: hello world

`cut`

Extract the first field from a comma-separated file:
Terminal window
```
cut -d ',' -f 1 data.csv
```
Output: (The first field of each line in data.csv)
Extract fields 1 and 3 from a tab-separated file:
Terminal window
```
cut -f 1,3 --output-delimiter='|' data.tsv
```
Output: (Fields 1 and 3 of each line in data.tsv, separated by |)
Extract characters 1 to 5 from each line:
Terminal window
```
cut -c 1-5 data.txt
```
Output: (The first 5 characters of each line in data.txt)

`paste`

Paste two files side by side, separated by a tab:
Terminal window
```
paste file1.txt file2.txt
```
Output: (Lines from file1.txt and file2.txt merged side-by-side, separated by a tab)
Paste two files side by side, separated by a comma:
Terminal window
```
paste -d ',' file1.txt file2.txt
```
Output: (Lines from file1.txt and file2.txt merged side-by-side, separated by a comma)

`join`

Join two files based on the first field:
Terminal window
```
join file1.txt file2.txt
```
(Assuming both files have the common key in the first field and are sorted) Output: (Lines from both files joined based on matching values in the first field)
Join two files based on specific fields (File1 field 2, File2 field 1):
Terminal window
```
join -1 2 -2 1 file1.txt file2.txt
```
Output: (Lines from both files joined based on matching values, where the key is the second field in file1.txt and the first field in file2.txt)
Join two files showing unmatched lines from file1 (-a 1):
Terminal window
```
join -a 1 file1.txt file2.txt
```
Output: (All lines from file1, matched lines from file2, and unmatched lines from file1 marked with a default fill.)

4. Common Options

`grep`

-i: Case-insensitive search.
-v: Invert match (show lines that do not match).
-r or -R: Recursive search.
-n: Show line numbers.
-c: Count the number of matching lines.
-l: List only the files containing matches.
-w: Match whole words only.
-E: Interpret PATTERN as an extended regular expression.

`sed`

-i: Edit the file in-place. WARNING: Use with caution!
-n: Suppress default output (useful with p command).
-e: Allow multiple commands (e.g., sed -e 's/a/b/' -e 's/c/d/' file.txt).
s/PATTERN/REPLACEMENT/: Substitute (replace) the first occurrence of PATTERN with REPLACEMENT.
s/PATTERN/REPLACEMENT/g: Substitute (replace) all occurrences of PATTERN with REPLACEMENT.
/PATTERN/d: Delete lines matching PATTERN.
/PATTERN/p: Print lines matching PATTERN (useful with -n).
i\: Insert text before a line.
a\: Append text after a line.

`awk`

-F: Specify the field separator. Defaults to whitespace. (e.g., -F',' for CSV files).
BEGIN { ACTION }: Execute ACTION before processing any lines.
END { ACTION }: Execute ACTION after processing all lines.
$0: Represents the entire line.
$1, $2, …: Represent the first, second, etc., fields.
NF: Number of fields in the current record.
NR: Number of the current record.

`tr`

-d: Delete characters in SET1.
-s: Squeeze repeating characters in SET1.
-c: Complement SET1 (use characters not in SET1).

`cut`

-d: Delimiter (character separating fields). Defaults to tab.
-f: Field list (comma-separated list of field numbers).
-c: Character list (comma-separated list of character positions).
--output-delimiter: Specify a different output delimiter.

`paste`

-d: Delimiter to use between pasted files. Default is tab.
-s: Paste all lines from one file into a single line, separated by the delimiter.

`join`

-1 FIELD: Field to use as the join key in the first file.
-2 FIELD: Field to use as the join key in the second file.
-a FILE_NUMBER: Show unpairable lines from the specified file (1 or 2).
-e STRING: Replace missing input fields with STRING.
-t CHAR: Use CHAR as the field separator.

5. Advanced Usage

`grep` with Regular Expressions

Find IP addresses in a file:
Terminal window
```
grep -E '([0-9]{1,3}\.){3}[0-9]{1,3}' /var/log/nginx/access.log
```
Output: (Lines containing IP addresses from the access log)
Find lines starting with “ERROR” or “WARNING”:
Terminal window
```
grep -E '^(ERROR|WARNING)' /var/log/syslog
```
Output: (Lines starting with ERROR or WARNING)

`sed` with Backreferences

Swap the first two words on each line:
Terminal window
```
sed -E 's/^([^ ]*) ([^ ]*)/\2 \1/' input.txt
```
Output: (Modified lines with the first two words swapped)

`awk` with Multiple Actions and Conditions

Calculate the average of the second column and print lines where the third column is above the average:
Terminal window
```
awk '{sum += $2; count++} END {avg = sum / count; print "Average:", avg} $3 > avg {print $0}' data.txt
```
Output: (The average of the second column, followed by lines where the third column is above the average)

Combining Commands with Pipes

Find all lines containing “error” in /var/log/syslog and then count the number of occurrences of “authentication”:
Terminal window
```
grep "error" /var/log/syslog | grep "authentication" | wc -l
```
Output: (The number of lines containing both “error” and “authentication”)
Extract usernames from /etc/passwd and sort them alphabetically:
Terminal window
```
cut -d ':' -f 1 /etc/passwd | sort
```
Output: (A sorted list of usernames from /etc/passwd)

Complex `sed` script for HTML escaping:

sed 's/&/\&amp;/g; s/</\&lt;/g; s/>/\&gt;/g; s/"/\&quot;/g; s/'"'"'/\&#39;/g' input.txt

This script replaces special HTML characters with their corresponding HTML entities. It demonstrates the power of chained sed commands.

6. Tips & Tricks

Use man pages: man grep, man sed, man awk, etc. for detailed documentation.
Test regular expressions: Use online regex testers (e.g., regex101.com) to validate your patterns.
Backup files before using sed -i: Prevent data loss.
Use variables in awk: Makes your scripts more readable and maintainable.
Quote your patterns: Especially if they contain special characters.
Performance: For large files, consider using awk or grep with optimized regular expressions for better performance. Avoid unnecessary piping.
tee for debugging: Use tee to save intermediate output to a file for debugging purposes. e.g., grep "pattern" input.txt | tee debug.txt | awk '{print $1}'
Use character classes: [:alnum:], [:alpha:], [:digit:], [:lower:], [:upper:], [:punct:], [:space:] for more robust pattern matching.

7. Troubleshooting

grep not finding matches:
- Check for typos in the pattern.
- Verify the case sensitivity. Use -i for case-insensitive search.
- Ensure the file exists and is readable.
- Check for newline characters or other hidden characters that might be interfering with the pattern.
sed -i corrupting the file:
- Always create a backup before using -i.
- Double-check the sed command syntax.
- If possible, test the command on a copy of the file first.
awk not processing fields correctly:
- Verify the field separator using -F.
- Check for inconsistent field separators in the input data.
- Make sure the fields you are referencing ($1, $2, etc.) actually exist.
join not working:
- Ensure both files are sorted according to the join key. Use sort if necessary.
- Verify that the join key fields are correctly specified using -1 and -2.
- Check for leading or trailing whitespace in the join key fields.
Regular Expression errors:
- Ensure your regular expression syntax is correct for the command you’re using (e.g., basic vs. extended).
- Escape special characters properly (e.g., \., \*, \+).

sort: Sort lines of text files.
uniq: Report or omit repeated lines.
wc: Count lines, words, and characters.
head: Output the first part of files.
tail: Output the last part of files.
diff: Compare files line by line.
find: Search for files in a directory hierarchy.
xargs: Build and execute command lines from standard input.
iconv: Convert text from one character encoding to another.

This cheatsheet provides a solid foundation for advanced text processing and regular expressions in Linux. Remember to practice and experiment with these commands to become proficient in their use. Good luck!

Advanced Text Processing and Regular Expressions