Suffix Array

suffix-array: The Ultimate Cheatsheet

1. Detailed Explanation

What is a Suffix Array? A suffix array is a sorted array of all suffixes of a given string. More formally, for a string S of length n, the suffix array SA is an array of integers of length n such that SA[i] is the starting position of the i-th smallest suffix of S.
Why is it important and what kind of problems does it solve? Suffix arrays are crucial for solving various string-related problems efficiently. They allow for fast substring searches, finding longest common substrings, identifying repeated patterns, and solving many bioinformatics and text processing tasks. They are a powerful alternative to suffix trees, often using less memory in practice.
Core concepts, underlying principles, and key terminology.
- Suffix: A substring of a string that starts at a particular position and extends to the end of the string. For example, the suffixes of “banana” are “banana”, “anana”, “nana”, “ana”, “na”, and “a”.
- Lexicographical Order: Suffixes are sorted in lexicographical (dictionary) order. “a” < “an” < “ana” < “anana” < “banana” < “na” < “nana”.
- Suffix Array (SA): An array of integers representing the starting positions of the sorted suffixes. For “banana”, SA would be [5, 3, 1, 0, 4, 2]. This means suffix starting at index 5 (“a”) is the smallest, then suffix starting at index 3 (“ana”), then suffix starting at index 1 (“anana”), and so on.
- Longest Common Prefix (LCP) Array: An array LCP such that LCP[i] is the length of the longest common prefix between the suffixes S[SA[i-1]:] and S[SA[i]:]. For “banana”, LCP would be [0, 1, 3, 0, 0, 2]. This requires an additional step to compute after the suffix array.
- Inverse Suffix Array (ISA) or Rank Array: An array ISA such that ISA[i] is the rank (position) of the suffix S[i:] in the sorted suffix array. ISA[SA[i]] = i.

2. When to Use suffix-array (and When Not To)

Identify problem patterns that suggest suffix-array is a good fit.
- Searching for substrings within a large text.
- Finding the longest common substring of two or more strings.
- Identifying repeated patterns or motifs in a sequence.
- Problems involving suffix-related queries, like finding the k-th smallest suffix.
- Bioinformatics tasks like sequence alignment and genome analysis.
Discuss scenarios where a different data structure or algorithm would be more appropriate.
- Simple substring search: If you only need to search for a few substrings and don’t need to perform complex suffix-related operations, simpler algorithms like strstr or KMP might be sufficient and faster to implement.
- Small input size: For small strings, the overhead of constructing the suffix array might outweigh its benefits. Brute-force approaches may be faster.
- Exact matches only: If you only need to find exact matches of short patterns, simpler string matching algorithms might be more efficient.
- Dynamic text: If the text is frequently modified, maintaining a suffix array can be expensive. Suffix trees are generally better suited to dynamic changes, though they have higher memory overhead.
- Regular Expression Matching: Regular expression matching is better handled by dedicated regular expression engines.

3. Core Algorithm / Data Structure Template

Here’s a general template for approaching problems using suffix arrays:

Construct the Suffix Array:
- Choose an efficient algorithm for suffix array construction. Common choices include:
  - O(n log^2 n) or O(n log n) algorithms: These are typically based on comparison sorting or radix sorting with clever optimizations. The code examples below use a O(n log^2 n) algorithm for clarity.
  - O(n) algorithms: Advanced algorithms like DC3/Skew algorithm exist, but are more complex to implement.
(Optional) Construct the LCP Array:
- If the problem requires LCP information (e.g., finding the longest repeated substring), compute the LCP array using Kasai’s algorithm or a similar method.
Solve the Problem:
- Use the suffix array (and LCP array if needed) to efficiently answer the problem’s specific queries. This often involves binary search or linear scans of the suffix array.

Pseudo-code for Suffix Array Construction (O(n log^2 n)):

function construct_suffix_array(text):
  n = length(text)
  suffixes = array of (suffix, index) pairs, where suffix is text[i:] and index is i
  sort suffixes lexicographically
  sa = array of indices from sorted suffixes
  return sa

Pseudo-code for LCP Array Construction (Kasai’s Algorithm - O(n)):

function construct_lcp_array(text, sa, isa):
  n = length(text)
  lcp = array of length n, initialized to 0
  h = 0
  for i from 0 to n - 1:
    if isa[i] > 0:
      j = sa[isa[i] - 1]
      while i + h < n and j + h < n and text[i + h] == text[j + h]:
        h = h + 1
      lcp[isa[i]] = h
      if h > 0:
        h = h - 1
  return lcp

4. Code Implementations (Python, Java, C++)

Python

def suffix_array(s):
    """
    Constructs the suffix array of a string s in O(n log^2 n) time.
    """
    n = len(s)
    suffixes = [(s[i:], i) for i in range(n)]
    suffixes.sort()  # Sorts lexicographically
    sa = [suffix[1] for suffix in suffixes]
    return sa

def inverse_suffix_array(sa):
    """
    Constructs the inverse suffix array from the suffix array.
    """
    n = len(sa)
    isa = [0] * n
    for i in range(n):
        isa[sa[i]] = i
    return isa

def lcp_array(s, sa, isa):
    """
    Constructs the LCP array using Kasai's algorithm in O(n) time.
    """
    n = len(s)
    lcp = [0] * n
    h = 0
    for i in range(n):
        if isa[i] > 0:
            j = sa[isa[i] - 1]
            while i + h < n and j + h < n and s[i + h] == s[j + h]:
                h += 1
            lcp[isa[i]] = h
            if h > 0:
                h -= 1
    return lcp

# Example usage:
if __name__ == "__main__":
    text = "banana"
    sa = suffix_array(text)
    isa = inverse_suffix_array(sa)
    lcp = lcp_array(text, sa, isa)

    print(f"Text: {text}")
    print(f"Suffix Array: {sa}")
    print(f"Inverse Suffix Array: {isa}")
    print(f"LCP Array: {lcp}")

Java

import java.util.Arrays;
import java.util.Comparator;

class SuffixArray {

    public static int[] suffixArray(String s) {
        int n = s.length();
        Integer[] indices = new Integer[n];
        for (int i = 0; i < n; i++) {
            indices[i] = i;
        }

        Arrays.sort(indices, Comparator.comparing(i -> s.substring(i)));

        int[] sa = new int[n];
        for (int i = 0; i < n; i++) {
            sa[i] = indices[i];
        }
        return sa;
    }

    public static int[] inverseSuffixArray(int[] sa) {
        int n = sa.length;
        int[] isa = new int[n];
        for (int i = 0; i < n; i++) {
            isa[sa[i]] = i;
        }
        return isa;
    }

    public static int[] lcpArray(String s, int[] sa, int[] isa) {
        int n = s.length();
        int[] lcp = new int[n];
        int h = 0;
        for (int i = 0; i < n; i++) {
            if (isa[i] > 0) {
                int j = sa[isa[i] - 1];
                while (i + h < n && j + h < n && s.charAt(i + h) == s.charAt(j + h)) {
                    h++;
                }
                lcp[isa[i]] = h;
                if (h > 0) {
                    h--;
                }
            }
        }
        return lcp;
    }

    public static void main(String[] args) {
        String text = "banana";
        int[] sa = suffixArray(text);
        int[] isa = inverseSuffixArray(sa);
        int[] lcp = lcpArray(text, sa, isa);

        System.out.println("Text: " + text);
        System.out.println("Suffix Array: " + Arrays.toString(sa));
        System.out.println("Inverse Suffix Array: " + Arrays.toString(isa));
        System.out.println("LCP Array: " + Arrays.toString(lcp));
    }
}

C++

#include <iostream>
#include <vector>
#include <string>
#include <algorithm>

using namespace std;

vector<int> suffixArray(const string& s) {
    int n = s.length();
    vector<int> indices(n);
    for (int i = 0; i < n; ++i) {
        indices[i] = i;
    }

    sort(indices.begin(), indices.end(), [&](int a, int b) {
        return s.substr(a) < s.substr(b);
    });

    return indices;
}

vector<int> inverseSuffixArray(const vector<int>& sa) {
    int n = sa.size();
    vector<int> isa(n);
    for (int i = 0; i < n; ++i) {
        isa[sa[i]] = i;
    }
    return isa;
}

vector<int> lcpArray(const string& s, const vector<int>& sa, const vector<int>& isa) {
    int n = s.length();
    vector<int> lcp(n, 0);
    int h = 0;
    for (int i = 0; i < n; ++i) {
        if (isa[i] > 0) {
            int j = sa[isa[i] - 1];
            while (i + h < n && j + h < n && s[i + h] == s[j + h]) {
                h++;
            }
            lcp[isa[i]] = h;
            if (h > 0) {
                h--;
            }
        }
    }
    return lcp;
}

int main() {
    string text = "banana";
    vector<int> sa = suffixArray(text);
    vector<int> isa = inverseSuffixArray(sa);
    vector<int> lcp = lcpArray(text, sa, isa);

    cout << "Text: " << text << endl;
    cout << "Suffix Array: ";
    for (int i : sa) cout << i << " ";
    cout << endl;

    cout << "Inverse Suffix Array: ";
    for (int i : isa) cout << i << " ";
    cout << endl;

    cout << "LCP Array: ";
    for (int i : lcp) cout << i << " ";
    cout << endl;

    return 0;
}

5. Complexity Analysis

Operation	Time Complexity	Space Complexity	Best Case	Average Case	Worst Case
Suffix Array Construction (O(n log^2 n))	O(n log^2 n)	O(n)	O(n log n)	O(n log^2 n)	O(n log^2 n)
LCP Array Construction (Kasai)	O(n)	O(n)	O(n)	O(n)	O(n)
Substring Search (using SA + Binary Search)	O(m log n)	O(1)	O(m)	O(m log n)	O(m log n)
Substring Search (using SA, LCP, and Binary Search)	O(m + log n)	O(1)	O(m)	O(m + log n)	O(m + log n)
Finding Longest Repeated Substring (using SA + LCP)	O(n)	O(1)	O(n)	O(n)	O(n)

n: Length of the string.
m: Length of the substring being searched for.

Explanation:

Suffix Array Construction: The O(n log^2 n) algorithm is relatively straightforward to implement. Faster O(n log n) and O(n) algorithms exist but are more complex.
LCP Array Construction: Kasai’s algorithm is the standard O(n) algorithm for computing the LCP array given the suffix array and inverse suffix array.
Substring Search: Using the suffix array, you can perform substring searches using binary search. Each comparison takes O(m) time. The LCP array can be used to optimize the binary search, reducing the overall complexity.
Longest Repeated Substring: The longest repeated substring can be found by finding the maximum value in the LCP array.

6. Pro Tips, Tricks, and Common Pitfalls

Choosing the Right Algorithm: For smaller strings (n < 1000), the O(n log^2 n) algorithm is often sufficient. For larger strings, consider more efficient algorithms like O(n log n) or O(n).
Memory Management: Suffix arrays can consume significant memory, especially for large strings. Be mindful of memory usage and consider using techniques like compact suffix arrays if memory is a concern.
Off-by-One Errors: Pay close attention to indexing when accessing elements in the suffix array and LCP array. Off-by-one errors are common.
Handling Special Characters: If the string contains special characters, ensure that the sorting algorithm handles them correctly. It’s often a good practice to use a consistent character encoding (e.g., UTF-8).
LCP Array Optimization: Precompute the LCP array and use it to speed up various queries, such as finding the longest common substring.
Binary Search Implementation: When using binary search with the suffix array, ensure that the comparison function is correctly implemented to handle lexicographical order.
Sentinel Characters: Consider appending a unique sentinel character (e.g., ’$’) to the end of the string during suffix array construction. This helps to avoid issues when comparing suffixes that are prefixes of each other.

7. Classic Problem Examples

Example: Longest Duplicate Substring

Description: Given a string s, consider all duplicated substrings: (contiguous) substrings of s that occur 2 or more times. The occurrences may overlap. Return any duplicated substring that has the longest possible length. If s does not have a duplicated substring, the answer is "". Example 1: Example 2: Constraints:

High-Level Approach:

Construct the Suffix Array: Build the suffix array for the input string s.
Construct the LCP Array: Build the LCP array corresponding to the suffix array.
Find the Maximum LCP Value: Iterate through the LCP array and find the maximum value. This value represents the length of the longest common prefix between two suffixes, which corresponds to the length of the longest duplicate substring.
Extract the Substring: If the maximum LCP value is greater than 0, extract the corresponding substring from the original string using the indices stored in the suffix array. The substring starts at the suffix sa[i] where lcp[i] is the maximum LCP value, and has a length of lcp[i].
Handle No Duplicate Substring: If the maximum LCP value is 0, return an empty string, indicating that there are no duplicate substrings.

def longestDupSubstring(s: str) -> str:
    """
    Finds the longest duplicate substring of a string s using suffix array and LCP array.
    """
    n = len(s)
    sa = suffix_array(s)
    isa = inverse_suffix_array(sa)
    lcp = lcp_array(s, sa, isa)

    max_lcp = 0
    max_index = -1

    for i in range(1, n):
        if lcp[i] > max_lcp:
            max_lcp = lcp[i]
            max_index = i

    if max_lcp == 0:
        return ""
    else:
        return s[sa[max_index]:sa[max_index] + max_lcp]