Skip to content

Advanced Text Processing and Regular Expressions

Category: Advanced Linux Administration
Type: Linux Commands
Generated on: 2025-07-10 03:16:05
For: System Administration, Development & Technical Interviews


Advanced Text Processing and Regular Expressions (Linux Commands) - Cheatsheet

Section titled “Advanced Text Processing and Regular Expressions (Linux Commands) - Cheatsheet”

This cheatsheet provides a comprehensive guide to advanced text processing and regular expressions using common Linux commands, tailored for system administrators and developers.

CommandDescriptionWhen to Use
grepGlobally search a Regular Expression and Print. Finds lines matching a pattern.Searching logs, configuration files, code, or any text-based data for specific information.
sedStream EDitor. Performs text transformations on a stream or file.Replacing text, deleting lines, inserting content, or performing more complex text manipulations.
awkPattern scanning and processing language. Processes data based on fields and records.Parsing structured data (e.g., CSV, log files), performing calculations, and generating reports.
trTranslate or delete characters. Replaces or removes characters in a stream.Converting case, removing control characters, and performing simple text transformations.
cutRemove sections from each line of files. Extracts specific columns from delimited data.Extracting data from CSV files, log files, or any data where fields are separated by delimiters.
pasteMerge lines of files. Joins lines from multiple files side-by-side.Combining data from different files based on line number or a common field.
joinJoin lines of two files on a common field. Combines data from two files based on a specified key field.Merging data from different files based on a common identifier (e.g., user ID, product ID).
Terminal window
grep [OPTIONS] PATTERN [FILE...]
Terminal window
sed [OPTIONS] 'COMMAND' [FILE...]
Terminal window
awk [OPTIONS] 'PATTERN { ACTION }' [FILE...]
Terminal window
tr [OPTIONS] SET1 [SET2]
Terminal window
cut [OPTIONS] [FILE...]
Terminal window
paste [OPTIONS] [FILE...]
Terminal window
join [OPTIONS] FILE1 FILE2
  • Search for a specific string in a file:

    Terminal window
    grep "error" /var/log/syslog

    Output: (Lines from /var/log/syslog containing “error”)

  • Search for a string case-insensitively:

    Terminal window
    grep -i "error" /var/log/syslog

    Output: (Lines from /var/log/syslog containing “error”, “Error”, “ERROR”, etc.)

  • Search recursively in a directory:

    Terminal window
    grep -r "password" /etc/

    Output: (File paths and lines containing “password” within the /etc/ directory and its subdirectories)

  • Replace the first occurrence of a string in a file:

    Terminal window
    sed 's/old_string/new_string/' input.txt

    Output: (Modified content of input.txt printed to standard output)

  • Replace all occurrences of a string in a file:

    Terminal window
    sed 's/old_string/new_string/g' input.txt

    Output: (Modified content of input.txt printed to standard output)

  • Replace all occurrences and write changes to the file (in-place):

    Terminal window
    sed -i 's/old_string/new_string/g' input.txt

    WARNING: This modifies the file directly. Consider creating a backup first. cp input.txt input.txt.bak

  • Delete lines containing a specific string:

    Terminal window
    sed '/error/d' /var/log/syslog

    Output: (Lines from /var/log/syslog excluding those containing “error”)

  • Print the first column of a CSV file:

    Terminal window
    awk -F',' '{print $1}' data.csv

    Output: (The first column of each row in data.csv, separated by newlines)

  • Print lines where the third column is greater than 10:

    Terminal window
    awk '$3 > 10 {print}' data.txt

    Output: (Lines from data.txt where the third field is greater than 10)

  • Calculate the sum of the second column:

    Terminal window
    awk '{sum += $2} END {print sum}' data.txt

    Output: (The sum of the values in the second column of data.txt)

  • Convert lowercase to uppercase:

    Terminal window
    echo "hello world" | tr '[:lower:]' '[:upper:]'

    Output: HELLO WORLD

  • Delete specific characters:

    Terminal window
    echo "hello world!" | tr -d '!'

    Output: hello world

  • Squeeze repeating characters:

    Terminal window
    echo "hello world" | tr -s ' '

    Output: hello world

  • Extract the first field from a comma-separated file:

    Terminal window
    cut -d ',' -f 1 data.csv

    Output: (The first field of each line in data.csv)

  • Extract fields 1 and 3 from a tab-separated file:

    Terminal window
    cut -f 1,3 --output-delimiter='|' data.tsv

    Output: (Fields 1 and 3 of each line in data.tsv, separated by |)

  • Extract characters 1 to 5 from each line:

    Terminal window
    cut -c 1-5 data.txt

    Output: (The first 5 characters of each line in data.txt)

  • Paste two files side by side, separated by a tab:

    Terminal window
    paste file1.txt file2.txt

    Output: (Lines from file1.txt and file2.txt merged side-by-side, separated by a tab)

  • Paste two files side by side, separated by a comma:

    Terminal window
    paste -d ',' file1.txt file2.txt

    Output: (Lines from file1.txt and file2.txt merged side-by-side, separated by a comma)

  • Join two files based on the first field:

    Terminal window
    join file1.txt file2.txt

    (Assuming both files have the common key in the first field and are sorted) Output: (Lines from both files joined based on matching values in the first field)

  • Join two files based on specific fields (File1 field 2, File2 field 1):

    Terminal window
    join -1 2 -2 1 file1.txt file2.txt

    Output: (Lines from both files joined based on matching values, where the key is the second field in file1.txt and the first field in file2.txt)

  • Join two files showing unmatched lines from file1 (-a 1):

    Terminal window
    join -a 1 file1.txt file2.txt

    Output: (All lines from file1, matched lines from file2, and unmatched lines from file1 marked with a default fill.)

  • -i: Case-insensitive search.
  • -v: Invert match (show lines that do not match).
  • -r or -R: Recursive search.
  • -n: Show line numbers.
  • -c: Count the number of matching lines.
  • -l: List only the files containing matches.
  • -w: Match whole words only.
  • -E: Interpret PATTERN as an extended regular expression.
  • -i: Edit the file in-place. WARNING: Use with caution!
  • -n: Suppress default output (useful with p command).
  • -e: Allow multiple commands (e.g., sed -e 's/a/b/' -e 's/c/d/' file.txt).
  • s/PATTERN/REPLACEMENT/: Substitute (replace) the first occurrence of PATTERN with REPLACEMENT.
  • s/PATTERN/REPLACEMENT/g: Substitute (replace) all occurrences of PATTERN with REPLACEMENT.
  • /PATTERN/d: Delete lines matching PATTERN.
  • /PATTERN/p: Print lines matching PATTERN (useful with -n).
  • i\: Insert text before a line.
  • a\: Append text after a line.
  • -F: Specify the field separator. Defaults to whitespace. (e.g., -F',' for CSV files).
  • BEGIN { ACTION }: Execute ACTION before processing any lines.
  • END { ACTION }: Execute ACTION after processing all lines.
  • $0: Represents the entire line.
  • $1, $2, …: Represent the first, second, etc., fields.
  • NF: Number of fields in the current record.
  • NR: Number of the current record.
  • -d: Delete characters in SET1.
  • -s: Squeeze repeating characters in SET1.
  • -c: Complement SET1 (use characters not in SET1).
  • -d: Delimiter (character separating fields). Defaults to tab.
  • -f: Field list (comma-separated list of field numbers).
  • -c: Character list (comma-separated list of character positions).
  • --output-delimiter: Specify a different output delimiter.
  • -d: Delimiter to use between pasted files. Default is tab.
  • -s: Paste all lines from one file into a single line, separated by the delimiter.
  • -1 FIELD: Field to use as the join key in the first file.
  • -2 FIELD: Field to use as the join key in the second file.
  • -a FILE_NUMBER: Show unpairable lines from the specified file (1 or 2).
  • -e STRING: Replace missing input fields with STRING.
  • -t CHAR: Use CHAR as the field separator.
  • Find IP addresses in a file:

    Terminal window
    grep -E '([0-9]{1,3}\.){3}[0-9]{1,3}' /var/log/nginx/access.log

    Output: (Lines containing IP addresses from the access log)

  • Find lines starting with “ERROR” or “WARNING”:

    Terminal window
    grep -E '^(ERROR|WARNING)' /var/log/syslog

    Output: (Lines starting with ERROR or WARNING)

  • Swap the first two words on each line:

    Terminal window
    sed -E 's/^([^ ]*) ([^ ]*)/\2 \1/' input.txt

    Output: (Modified lines with the first two words swapped)

  • Calculate the average of the second column and print lines where the third column is above the average:

    Terminal window
    awk '{sum += $2; count++} END {avg = sum / count; print "Average:", avg} $3 > avg {print $0}' data.txt

    Output: (The average of the second column, followed by lines where the third column is above the average)

  • Find all lines containing “error” in /var/log/syslog and then count the number of occurrences of “authentication”:

    Terminal window
    grep "error" /var/log/syslog | grep "authentication" | wc -l

    Output: (The number of lines containing both “error” and “authentication”)

  • Extract usernames from /etc/passwd and sort them alphabetically:

    Terminal window
    cut -d ':' -f 1 /etc/passwd | sort

    Output: (A sorted list of usernames from /etc/passwd)

Terminal window
sed 's/&/\&amp;/g; s/</\&lt;/g; s/>/\&gt;/g; s/"/\&quot;/g; s/'"'"'/\&#39;/g' input.txt

This script replaces special HTML characters with their corresponding HTML entities. It demonstrates the power of chained sed commands.

  • Use man pages: man grep, man sed, man awk, etc. for detailed documentation.
  • Test regular expressions: Use online regex testers (e.g., regex101.com) to validate your patterns.
  • Backup files before using sed -i: Prevent data loss.
  • Use variables in awk: Makes your scripts more readable and maintainable.
  • Quote your patterns: Especially if they contain special characters.
  • Performance: For large files, consider using awk or grep with optimized regular expressions for better performance. Avoid unnecessary piping.
  • tee for debugging: Use tee to save intermediate output to a file for debugging purposes. e.g., grep "pattern" input.txt | tee debug.txt | awk '{print $1}'
  • Use character classes: [:alnum:], [:alpha:], [:digit:], [:lower:], [:upper:], [:punct:], [:space:] for more robust pattern matching.
  • grep not finding matches:
    • Check for typos in the pattern.
    • Verify the case sensitivity. Use -i for case-insensitive search.
    • Ensure the file exists and is readable.
    • Check for newline characters or other hidden characters that might be interfering with the pattern.
  • sed -i corrupting the file:
    • Always create a backup before using -i.
    • Double-check the sed command syntax.
    • If possible, test the command on a copy of the file first.
  • awk not processing fields correctly:
    • Verify the field separator using -F.
    • Check for inconsistent field separators in the input data.
    • Make sure the fields you are referencing ($1, $2, etc.) actually exist.
  • join not working:
    • Ensure both files are sorted according to the join key. Use sort if necessary.
    • Verify that the join key fields are correctly specified using -1 and -2.
    • Check for leading or trailing whitespace in the join key fields.
  • Regular Expression errors:
    • Ensure your regular expression syntax is correct for the command you’re using (e.g., basic vs. extended).
    • Escape special characters properly (e.g., \., \*, \+).
  • sort: Sort lines of text files.
  • uniq: Report or omit repeated lines.
  • wc: Count lines, words, and characters.
  • head: Output the first part of files.
  • tail: Output the last part of files.
  • diff: Compare files line by line.
  • find: Search for files in a directory hierarchy.
  • xargs: Build and execute command lines from standard input.
  • iconv: Convert text from one character encoding to another.

This cheatsheet provides a solid foundation for advanced text processing and regular expressions in Linux. Remember to practice and experiment with these commands to become proficient in their use. Good luck!