Suffix Array
suffix-array: The Ultimate Cheatsheet
Section titled “suffix-array: The Ultimate Cheatsheet”1. Detailed Explanation
Section titled “1. Detailed Explanation”-
What is a Suffix Array? A suffix array is a sorted array of all suffixes of a given string. More formally, for a string
Sof lengthn, the suffix arraySAis an array of integers of lengthnsuch thatSA[i]is the starting position of the i-th smallest suffix ofS. -
Why is it important and what kind of problems does it solve? Suffix arrays are crucial for solving various string-related problems efficiently. They allow for fast substring searches, finding longest common substrings, identifying repeated patterns, and solving many bioinformatics and text processing tasks. They are a powerful alternative to suffix trees, often using less memory in practice.
-
Core concepts, underlying principles, and key terminology.
- Suffix: A substring of a string that starts at a particular position and extends to the end of the string. For example, the suffixes of “banana” are “banana”, “anana”, “nana”, “ana”, “na”, and “a”.
- Lexicographical Order: Suffixes are sorted in lexicographical (dictionary) order. “a” < “an” < “ana” < “anana” < “banana” < “na” < “nana”.
- Suffix Array (SA): An array of integers representing the starting positions of the sorted suffixes. For “banana”, SA would be
[5, 3, 1, 0, 4, 2]. This means suffix starting at index 5 (“a”) is the smallest, then suffix starting at index 3 (“ana”), then suffix starting at index 1 (“anana”), and so on. - Longest Common Prefix (LCP) Array: An array
LCPsuch thatLCP[i]is the length of the longest common prefix between the suffixesS[SA[i-1]:]andS[SA[i]:]. For “banana”, LCP would be[0, 1, 3, 0, 0, 2]. This requires an additional step to compute after the suffix array. - Inverse Suffix Array (ISA) or Rank Array: An array
ISAsuch thatISA[i]is the rank (position) of the suffixS[i:]in the sorted suffix array.ISA[SA[i]] = i.
2. When to Use suffix-array (and When Not To)
Section titled “2. When to Use suffix-array (and When Not To)”-
Identify problem patterns that suggest suffix-array is a good fit.
- Searching for substrings within a large text.
- Finding the longest common substring of two or more strings.
- Identifying repeated patterns or motifs in a sequence.
- Problems involving suffix-related queries, like finding the k-th smallest suffix.
- Bioinformatics tasks like sequence alignment and genome analysis.
-
Discuss scenarios where a different data structure or algorithm would be more appropriate.
- Simple substring search: If you only need to search for a few substrings and don’t need to perform complex suffix-related operations, simpler algorithms like
strstrorKMPmight be sufficient and faster to implement. - Small input size: For small strings, the overhead of constructing the suffix array might outweigh its benefits. Brute-force approaches may be faster.
- Exact matches only: If you only need to find exact matches of short patterns, simpler string matching algorithms might be more efficient.
- Dynamic text: If the text is frequently modified, maintaining a suffix array can be expensive. Suffix trees are generally better suited to dynamic changes, though they have higher memory overhead.
- Regular Expression Matching: Regular expression matching is better handled by dedicated regular expression engines.
- Simple substring search: If you only need to search for a few substrings and don’t need to perform complex suffix-related operations, simpler algorithms like
3. Core Algorithm / Data Structure Template
Section titled “3. Core Algorithm / Data Structure Template”Here’s a general template for approaching problems using suffix arrays:
- Construct the Suffix Array:
- Choose an efficient algorithm for suffix array construction. Common choices include:
- O(n log^2 n) or O(n log n) algorithms: These are typically based on comparison sorting or radix sorting with clever optimizations. The code examples below use a O(n log^2 n) algorithm for clarity.
- O(n) algorithms: Advanced algorithms like DC3/Skew algorithm exist, but are more complex to implement.
- Choose an efficient algorithm for suffix array construction. Common choices include:
- (Optional) Construct the LCP Array:
- If the problem requires LCP information (e.g., finding the longest repeated substring), compute the LCP array using Kasai’s algorithm or a similar method.
- Solve the Problem:
- Use the suffix array (and LCP array if needed) to efficiently answer the problem’s specific queries. This often involves binary search or linear scans of the suffix array.
Pseudo-code for Suffix Array Construction (O(n log^2 n)):
function construct_suffix_array(text): n = length(text) suffixes = array of (suffix, index) pairs, where suffix is text[i:] and index is i sort suffixes lexicographically sa = array of indices from sorted suffixes return saPseudo-code for LCP Array Construction (Kasai’s Algorithm - O(n)):
function construct_lcp_array(text, sa, isa): n = length(text) lcp = array of length n, initialized to 0 h = 0 for i from 0 to n - 1: if isa[i] > 0: j = sa[isa[i] - 1] while i + h < n and j + h < n and text[i + h] == text[j + h]: h = h + 1 lcp[isa[i]] = h if h > 0: h = h - 1 return lcp4. Code Implementations (Python, Java, C++)
Section titled “4. Code Implementations (Python, Java, C++)”Python
Section titled “Python”def suffix_array(s): """ Constructs the suffix array of a string s in O(n log^2 n) time. """ n = len(s) suffixes = [(s[i:], i) for i in range(n)] suffixes.sort() # Sorts lexicographically sa = [suffix[1] for suffix in suffixes] return sa
def inverse_suffix_array(sa): """ Constructs the inverse suffix array from the suffix array. """ n = len(sa) isa = [0] * n for i in range(n): isa[sa[i]] = i return isa
def lcp_array(s, sa, isa): """ Constructs the LCP array using Kasai's algorithm in O(n) time. """ n = len(s) lcp = [0] * n h = 0 for i in range(n): if isa[i] > 0: j = sa[isa[i] - 1] while i + h < n and j + h < n and s[i + h] == s[j + h]: h += 1 lcp[isa[i]] = h if h > 0: h -= 1 return lcp
# Example usage:if __name__ == "__main__": text = "banana" sa = suffix_array(text) isa = inverse_suffix_array(sa) lcp = lcp_array(text, sa, isa)
print(f"Text: {text}") print(f"Suffix Array: {sa}") print(f"Inverse Suffix Array: {isa}") print(f"LCP Array: {lcp}")import java.util.Arrays;import java.util.Comparator;
class SuffixArray {
public static int[] suffixArray(String s) { int n = s.length(); Integer[] indices = new Integer[n]; for (int i = 0; i < n; i++) { indices[i] = i; }
Arrays.sort(indices, Comparator.comparing(i -> s.substring(i)));
int[] sa = new int[n]; for (int i = 0; i < n; i++) { sa[i] = indices[i]; } return sa; }
public static int[] inverseSuffixArray(int[] sa) { int n = sa.length; int[] isa = new int[n]; for (int i = 0; i < n; i++) { isa[sa[i]] = i; } return isa; }
public static int[] lcpArray(String s, int[] sa, int[] isa) { int n = s.length(); int[] lcp = new int[n]; int h = 0; for (int i = 0; i < n; i++) { if (isa[i] > 0) { int j = sa[isa[i] - 1]; while (i + h < n && j + h < n && s.charAt(i + h) == s.charAt(j + h)) { h++; } lcp[isa[i]] = h; if (h > 0) { h--; } } } return lcp; }
public static void main(String[] args) { String text = "banana"; int[] sa = suffixArray(text); int[] isa = inverseSuffixArray(sa); int[] lcp = lcpArray(text, sa, isa);
System.out.println("Text: " + text); System.out.println("Suffix Array: " + Arrays.toString(sa)); System.out.println("Inverse Suffix Array: " + Arrays.toString(isa)); System.out.println("LCP Array: " + Arrays.toString(lcp)); }}#include <iostream>#include <vector>#include <string>#include <algorithm>
using namespace std;
vector<int> suffixArray(const string& s) { int n = s.length(); vector<int> indices(n); for (int i = 0; i < n; ++i) { indices[i] = i; }
sort(indices.begin(), indices.end(), [&](int a, int b) { return s.substr(a) < s.substr(b); });
return indices;}
vector<int> inverseSuffixArray(const vector<int>& sa) { int n = sa.size(); vector<int> isa(n); for (int i = 0; i < n; ++i) { isa[sa[i]] = i; } return isa;}
vector<int> lcpArray(const string& s, const vector<int>& sa, const vector<int>& isa) { int n = s.length(); vector<int> lcp(n, 0); int h = 0; for (int i = 0; i < n; ++i) { if (isa[i] > 0) { int j = sa[isa[i] - 1]; while (i + h < n && j + h < n && s[i + h] == s[j + h]) { h++; } lcp[isa[i]] = h; if (h > 0) { h--; } } } return lcp;}
int main() { string text = "banana"; vector<int> sa = suffixArray(text); vector<int> isa = inverseSuffixArray(sa); vector<int> lcp = lcpArray(text, sa, isa);
cout << "Text: " << text << endl; cout << "Suffix Array: "; for (int i : sa) cout << i << " "; cout << endl;
cout << "Inverse Suffix Array: "; for (int i : isa) cout << i << " "; cout << endl;
cout << "LCP Array: "; for (int i : lcp) cout << i << " "; cout << endl;
return 0;}5. Complexity Analysis
Section titled “5. Complexity Analysis”| Operation | Time Complexity | Space Complexity | Best Case | Average Case | Worst Case |
|---|---|---|---|---|---|
| Suffix Array Construction (O(n log^2 n)) | O(n log^2 n) | O(n) | O(n log n) | O(n log^2 n) | O(n log^2 n) |
| LCP Array Construction (Kasai) | O(n) | O(n) | O(n) | O(n) | O(n) |
| Substring Search (using SA + Binary Search) | O(m log n) | O(1) | O(m) | O(m log n) | O(m log n) |
| Substring Search (using SA, LCP, and Binary Search) | O(m + log n) | O(1) | O(m) | O(m + log n) | O(m + log n) |
| Finding Longest Repeated Substring (using SA + LCP) | O(n) | O(1) | O(n) | O(n) | O(n) |
- n: Length of the string.
- m: Length of the substring being searched for.
Explanation:
- Suffix Array Construction: The O(n log^2 n) algorithm is relatively straightforward to implement. Faster O(n log n) and O(n) algorithms exist but are more complex.
- LCP Array Construction: Kasai’s algorithm is the standard O(n) algorithm for computing the LCP array given the suffix array and inverse suffix array.
- Substring Search: Using the suffix array, you can perform substring searches using binary search. Each comparison takes O(m) time. The LCP array can be used to optimize the binary search, reducing the overall complexity.
- Longest Repeated Substring: The longest repeated substring can be found by finding the maximum value in the LCP array.
6. Pro Tips, Tricks, and Common Pitfalls
Section titled “6. Pro Tips, Tricks, and Common Pitfalls”- Choosing the Right Algorithm: For smaller strings (n < 1000), the O(n log^2 n) algorithm is often sufficient. For larger strings, consider more efficient algorithms like O(n log n) or O(n).
- Memory Management: Suffix arrays can consume significant memory, especially for large strings. Be mindful of memory usage and consider using techniques like compact suffix arrays if memory is a concern.
- Off-by-One Errors: Pay close attention to indexing when accessing elements in the suffix array and LCP array. Off-by-one errors are common.
- Handling Special Characters: If the string contains special characters, ensure that the sorting algorithm handles them correctly. It’s often a good practice to use a consistent character encoding (e.g., UTF-8).
- LCP Array Optimization: Precompute the LCP array and use it to speed up various queries, such as finding the longest common substring.
- Binary Search Implementation: When using binary search with the suffix array, ensure that the comparison function is correctly implemented to handle lexicographical order.
- Sentinel Characters: Consider appending a unique sentinel character (e.g., ’$’) to the end of the string during suffix array construction. This helps to avoid issues when comparing suffixes that are prefixes of each other.
7. Classic Problem Examples
Section titled “7. Classic Problem Examples”Description: Given a string s, consider all duplicated substrings: (contiguous) substrings of s that occur 2 or more times. The occurrences may overlap. Return any duplicated substring that has the longest possible length. If s does not have a duplicated substring, the answer is "". Example 1: Example 2: Constraints:
High-Level Approach:
- Construct the Suffix Array: Build the suffix array for the input string
s. - Construct the LCP Array: Build the LCP array corresponding to the suffix array.
- Find the Maximum LCP Value: Iterate through the LCP array and find the maximum value. This value represents the length of the longest common prefix between two suffixes, which corresponds to the length of the longest duplicate substring.
- Extract the Substring: If the maximum LCP value is greater than 0, extract the corresponding substring from the original string using the indices stored in the suffix array. The substring starts at the suffix
sa[i]wherelcp[i]is the maximum LCP value, and has a length oflcp[i]. - Handle No Duplicate Substring: If the maximum LCP value is 0, return an empty string, indicating that there are no duplicate substrings.
def longestDupSubstring(s: str) -> str: """ Finds the longest duplicate substring of a string s using suffix array and LCP array. """ n = len(s) sa = suffix_array(s) isa = inverse_suffix_array(sa) lcp = lcp_array(s, sa, isa)
max_lcp = 0 max_index = -1
for i in range(1, n): if lcp[i] > max_lcp: max_lcp = lcp[i] max_index = i
if max_lcp == 0: return "" else: return s[sa[max_index]:sa[max_index] + max_lcp]