Hash Function

hash-function: The Ultimate Cheatsheet

1. Detailed Explanation

What is a hash function? A hash function is a function that takes an input (or ‘key’) and produces a fixed-size output (a ‘hash value’ or ‘hash code’). The hash value serves as a representative of the original input. A good hash function should be deterministic (always producing the same hash for the same input) and should distribute inputs evenly across the possible output values to minimize collisions.
Why is it important and what kind of problems does it solve? Hash functions are crucial for:
- Hash Tables: Enabling efficient data retrieval, insertion, and deletion (average case O(1) time complexity).
- Data Integrity: Verifying the integrity of data by comparing hash values before and after transmission or storage. Any change in the data will result in a different hash value.
- Cryptography: Used in cryptographic algorithms for password storage, digital signatures, and message authentication.
- Caching: Quickly locating cached data based on a key.
- Load Balancing: Distributing network traffic across multiple servers.
- Data Deduplication: Identifying and removing duplicate data.
Core concepts, underlying principles, and key terminology:
- Key: The input to the hash function.
- Hash Value (Hash Code): The output of the hash function.
- Hash Table: A data structure that uses a hash function to map keys to their corresponding values.
- Collision: When two different keys produce the same hash value.
- Collision Resolution: Techniques to handle collisions (e.g., chaining, open addressing).
- Uniform Hashing: The ideal property of a hash function where each key is equally likely to be hashed to any slot in the hash table.
- Deterministic: A hash function must always produce the same hash value for the same input.
- Hash Space: The range of possible hash values.
- Load Factor: The ratio of the number of keys in a hash table to the number of slots (buckets) in the table. A high load factor increases the likelihood of collisions.

2. When to Use hash-function (and When Not To)

Identify problem patterns that suggest a hash function is a good fit:
- Frequent lookups: When you need to quickly check if an element exists in a collection.
- Counting element frequencies: When you need to count the occurrences of different elements.
- Detecting duplicates: When you need to identify duplicate elements in a large dataset.
- Caching: When you need to store and retrieve data quickly based on a key.
- Unordered data: When the order of elements doesn’t matter.
Discuss scenarios where a different data structure or algorithm would be more appropriate:
- Ordered data: If you need to maintain the order of elements, a sorted array, a balanced binary search tree (e.g., AVL tree, Red-Black tree), or a linked list might be more appropriate.
- Range queries: If you need to perform range queries (e.g., find all elements within a certain range), a hash table is not the best choice. Consider using a balanced binary search tree or a segment tree.
- Small datasets: For very small datasets, the overhead of calculating hash values and managing collisions might outweigh the benefits of using a hash table. A simple linear search might be faster.
- Guaranteed worst-case performance: Hash tables have O(1) average-case time complexity for insertion, deletion, and lookup, but the worst-case time complexity can be O(n) if all keys hash to the same slot. If you need guaranteed O(log n) performance for these operations, a balanced binary search tree is a better choice.

3. Core Algorithm / Data Structure Template

// High-level template for using a hash function and hash table:

1. **Choose a suitable hash function:**  Consider the data type of the keys and the size of the hash table.  Aim for uniform distribution to minimize collisions.

2. **Determine the size of the hash table:**  Choose a size that is large enough to accommodate the expected number of keys while keeping the load factor low (typically less than 0.7). Prime numbers are often used as hash table sizes to improve distribution.

3. **Implement collision resolution:**  Select a collision resolution strategy (e.g., chaining, open addressing).

4. **Insertion:**
   - Calculate the hash value of the key.
   - Determine the index in the hash table using the hash value (e.g., index = hash_value % table_size).
   - Insert the key-value pair into the hash table at the calculated index, handling collisions as needed.

5. **Lookup:**
   - Calculate the hash value of the key.
   - Determine the index in the hash table using the hash value.
   - Search for the key at the calculated index, handling collisions as needed.
   - Return the corresponding value if the key is found, or a default value (e.g., null) if the key is not found.

6. **Deletion:**
   - Calculate the hash value of the key.
   - Determine the index in the hash table using the hash value.
   - Search for the key at the calculated index, handling collisions as needed.
   - Remove the key-value pair from the hash table.

4. Code Implementations (Python, Java, C++)

Python

class HashTable:
    def __init__(self, size=11):  # Prime number for better distribution
        self.size = size
        self.table = [None] * self.size
        self.keys = [None] * self.size # Keep track of keys to allow key iteration

    def __len__(self):
        return sum(1 for x in self.table if x is not None)

    def __contains__(self, key):
        index = self._hash(key) % self.size
        if self.table[index] is not None:
            for k, _ in self.table[index]:
                if k == key:
                    return True
        return False

    def _hash(self, key):
        # A simple hash function.  Consider using more sophisticated hash functions for real-world applications.
        return hash(key)

    def insert(self, key, value):
        index = self._hash(key) % self.size
        if self.table[index] is None:
            self.table[index] = [(key, value)] # Create a list of key-value pairs (chaining)
        else:
            # Check if key already exists, if so update value
            for i, (k, v) in enumerate(self.table[index]):
                if k == key:
                    self.table[index][i] = (key, value)
                    return
            self.table[index].append((key, value)) # Add to the chain

        if key not in self.keys:
            for i in range(len(self.keys)):
                if self.keys[i] is None:
                    self.keys[i] = key
                    break

    def get(self, key):
        index = self._hash(key) % self.size
        if self.table[index] is not None:
            for k, v in self.table[index]:
                if k == key:
                    return v
        return None  # Key not found

    def delete(self, key):
        index = self._hash(key) % self.size
        if self.table[index] is not None:
            original_length = len(self.table[index])
            self.table[index] = [(k, v) for k, v in self.table[index] if k != key]
            if len(self.table[index]) < original_length:
                #Remove the key from self.keys
                try:
                    i = self.keys.index(key)
                    self.keys[i] = None
                except ValueError:
                    pass #Key was present in the list but not the keys array, this should not happen.

    def __iter__(self):
        #Iterate over all keys in the hashtable
        for key in self.keys:
            if key is not None:
                yield key

# Example Usage:
ht = HashTable()
ht.insert("apple", 1)
ht.insert("banana", 2)
ht.insert("cherry", 3)
print(ht.get("banana"))  # Output: 2
ht.delete("banana")
print(ht.get("banana"))  # Output: None
print(len(ht)) # Output: 2
print("apple" in ht) # Output: True
for key in ht:
    print(key)

Java

import java.util.LinkedList;

class HashTable<K, V> {
    private static final int DEFAULT_SIZE = 11; // Prime number
    private LinkedList<Entry<K, V>>[] table;
    private K[] keys;
    private int keyCount;

    public HashTable() {
        this(DEFAULT_SIZE);
    }

    public HashTable(int size) {
        table = new LinkedList[size];
        keys = (K[]) new Object[size]; //Type safety is not guaranteed at compile time.
        keyCount = 0;
    }

    private int hash(K key) {
        return Math.abs(key.hashCode()) % table.length;
    }

    public void insert(K key, V value) {
        int index = hash(key);
        if (table[index] == null) {
            table[index] = new LinkedList<>();
        }

        for (Entry<K, V> entry : table[index]) {
            if (entry.key.equals(key)) {
                entry.value = value; // Update existing value
                return;
            }
        }

        table[index].add(new Entry<>(key, value));

        //Store the key, this allows iteration
        boolean found = false;
        for(int i = 0; i < keys.length; i++){
            if(keys[i] == null){
                keys[i] = key;
                found = true;
                break;
            }
        }
        if(!found){
            //If the keys array is full, we need to resize it.
            K[] newKeys = (K[]) new Object[keys.length * 2];
            System.arraycopy(keys, 0, newKeys, 0, keys.length);
            keys = newKeys;
            for(int i = keys.length / 2; i < keys.length; i++){
                if(keys[i] == null){
                    keys[i] = key;
                    break;
                }
            }

        }


    }

    public V get(K key) {
        int index = hash(key);
        if (table[index] != null) {
            for (Entry<K, V> entry : table[index]) {
                if (entry.key.equals(key)) {
                    return entry.value;
                }
            }
        }
        return null;
    }

    public void delete(K key) {
        int index = hash(key);
        if (table[index] != null) {
            table[index].removeIf(entry -> entry.key.equals(key));

            //Remove the key from the keys array
            for(int i = 0; i < keys.length; i++){
                if(keys[i] != null && keys[i].equals(key)){
                    keys[i] = null;
                    break;
                }
            }
        }
    }

    public boolean contains(K key) {
        int index = hash(key);
        if (table[index] != null) {
            for (Entry<K, V> entry : table[index]) {
                if (entry.key.equals(key)) {
                    return true;
                }
            }
        }
        return false;
    }

    public int size() {
        int count = 0;
        for (LinkedList<Entry<K, V>> list : table) {
            if (list != null) {
                count += list.size();
            }
        }
        return count;
    }

    public Iterable<K> getKeys(){
        return new Iterable<K>() {
            @Override
            public java.util.Iterator<K> iterator() {
                return new java.util.Iterator<K>() {
                    int currentIndex = 0;

                    @Override
                    public boolean hasNext() {
                        while (currentIndex < keys.length && keys[currentIndex] == null) {
                            currentIndex++;
                        }
                        return currentIndex < keys.length;
                    }

                    @Override
                    public K next() {
                        if (!hasNext()) {
                            throw new java.util.NoSuchElementException();
                        }
                        return keys[currentIndex++];
                    }
                };
            }
        };
    }


    private static class Entry<K, V> {
        K key;
        V value;

        public Entry(K key, V value) {
            this.key = key;
            this.value = value;
        }
    }

    public static void main(String[] args) {
        HashTable<String, Integer> ht = new HashTable<>();
        ht.insert("apple", 1);
        ht.insert("banana", 2);
        ht.insert("cherry", 3);
        System.out.println(ht.get("banana")); // Output: 2
        ht.delete("banana");
        System.out.println(ht.get("banana")); // Output: null
        System.out.println(ht.size()); // Output: 2
        System.out.println(ht.contains("apple")); // Output: true
        for(String key : ht.getKeys()){
            System.out.println(key);
        }
    }
}

C++

#include <iostream>
#include <list>
#include <string>
#include <vector>

template <typename K, typename V>
class HashTable {
private:
    static const int DEFAULT_SIZE = 11; // Prime number
    std::list<std::pair<K, V>>* table;
    std::vector<K> keys;
    int tableSize;

    int hash(const K& key) const {
        // Simple hash function using std::hash (may need custom specialization for user-defined types).
        return std::abs(std::hash<K>{}(key) % tableSize);
    }

public:
    HashTable(int size = DEFAULT_SIZE) : tableSize(size), table(new std::list<std::pair<K, V>>[size]) {
        keys.reserve(size); // Reserve space for keys
    }

    ~HashTable() {
        delete[] table;
    }

    void insert(const K& key, const V& value) {
        int index = hash(key);
        for (auto& entry : table[index]) {
            if (entry.first == key) {
                entry.second = value; // Update existing value
                return;
            }
        }
        table[index].push_back(std::make_pair(key, value));

        // Store key, this allows iteration
        bool found = false;
        for (K& k : keys) {
            if (k == K()) { // Check for default-constructed K (empty slot)
                k = key;
                found = true;
                break;
            }
        }
        if (!found) {
            keys.push_back(key);
        }
    }

    V get(const K& key) const {
        int index = hash(key);
        for (const auto& entry : table[index]) {
            if (entry.first == key) {
                return entry.second;
            }
        }
        return V(); // Return default-constructed V if not found
    }

    void remove(const K& key) {
        int index = hash(key);
        table[index].remove_if([&](const std::pair<K, V>& entry) { return entry.first == key; });

        // Remove key from keys vector
        for (size_t i = 0; i < keys.size(); ++i) {
            if (keys[i] == key) {
                keys[i] = K(); // Set to default-constructed K
                break;
            }
        }
    }

    bool contains(const K& key) const {
        int index = hash(key);
        for (const auto& entry : table[index]) {
            if (entry.first == key) {
                return true;
            }
        }
        return false;
    }

    int size() const {
        int count = 0;
        for (int i = 0; i < tableSize; ++i) {
            count += std::distance(table[i].begin(), table[i].end());
        }
        return count;
    }

    // Iterator over keys
    class KeyIterator {
    private:
        typename std::vector<K>::iterator current;
        typename std::vector<K>::iterator end;

    public:
        KeyIterator(typename std::vector<K>::iterator begin, typename std::vector<K>::iterator end) : current(begin), end(end) {}

        bool hasNext() const {
            typename std::vector<K>::iterator temp = current;
            while (temp != end && *temp == K()) {
                ++temp;
            }
            return temp != end;
        }

        K next() {
            if (!hasNext()) {
                throw std::out_of_range("No more elements");
            }
            while (*current == K()) {
                ++current;
            }
            return *current++;
        }
    };

    KeyIterator keyIterator() {
        return KeyIterator(keys.begin(), keys.end());
    }

    void printKeys() {
        for (const K& key : keys) {
            if (key != K()) {
                std::cout << key << " ";
            }
        }
        std::cout << std::endl;
    }

};

int main() {
    HashTable<std::string, int> ht;
    ht.insert("apple", 1);
    ht.insert("banana", 2);
    ht.insert("cherry", 3);

    std::cout << ht.get("banana") << std::endl; // Output: 2
    ht.remove("banana");
    std::cout << ht.get("banana") << std::endl; // Output: 0 (default int)
    std::cout << ht.size() << std::endl; // Output: 2
    std::cout << ht.contains("apple") << std::endl; // Output: 1 (true)

    std::cout << "Keys: ";
    ht.printKeys();

    HashTable<int, std::string> ht2;
    ht2.insert(1, "value1");
    ht2.insert(2, "value2");

    HashTable<int, std::string>::KeyIterator it = ht2.keyIterator();

    std::cout << "Iterating keys: ";
    while(it.hasNext()){
        std::cout << it.next() << " ";
    }
    std::cout << std::endl;

    return 0;
}

5. Complexity Analysis

Operation	Best Case	Average Case	Worst Case	Space Complexity
Insertion	O(1)	O(1)	O(n)	O(n)
Lookup	O(1)	O(1)	O(n)	O(1)
Deletion	O(1)	O(1)	O(n)	O(1)

Best Case: The key is found immediately in the first slot (no collisions).
Average Case: Assuming a good hash function and a low load factor, the average time complexity for insertion, lookup, and deletion is O(1).
Worst Case: All keys hash to the same slot, resulting in a linked list of length n (where n is the number of keys). In this case, insertion, lookup, and deletion take O(n) time. Space complexity is O(n) because we are storing all keys.

6. Pro Tips, Tricks, and Common Pitfalls

Pro Tips and Tricks:
- Choose a good hash function: This is the most important factor for hash table performance. Use well-established hash functions like MurmurHash, FNV-1a, or CityHash. Avoid simple hash functions that are prone to collisions. For strings, consider using polynomial rolling hash.
- Use a prime number for the hash table size: This helps to distribute keys more evenly and reduce collisions.
- Keep the load factor low: A load factor of 0.7 or less is generally recommended. Resize the hash table if the load factor becomes too high.
- Implement rehashing: When the hash table becomes too full, resize it to a larger size and rehash all the keys. This maintains good performance as the number of keys increases.
- Use separate chaining with linked lists or dynamic arrays: Linked lists are simple to implement, but dynamic arrays can offer better performance if the number of collisions is small.
- Consider using open addressing with probing techniques (linear probing, quadratic probing, double hashing): Open addressing can be more memory-efficient than separate chaining, but it can also be more susceptible to clustering (keys tending to hash to consecutive slots). Double hashing is generally the best open addressing technique for avoiding clustering.
- For custom objects: Always override equals() and hashCode() methods together in Java (or equivalent in other languages) to ensure consistent behavior.
Common Pitfalls:
- Poor hash function: Choosing a hash function that doesn’t distribute keys evenly can lead to many collisions and poor performance.
- Ignoring collisions: Not handling collisions properly can lead to data loss or incorrect results.
- High load factor: Allowing the load factor to become too high can degrade performance significantly.
- Not resizing the hash table: If the hash table becomes too full, insertion, lookup, and deletion will become slow.
- Integer overflow: Be careful of integer overflow when calculating hash values, especially when using multiplication. Use modulo operator appropriately.
- Mutable keys: Using mutable objects as keys can lead to unexpected behavior if the key is modified after it has been inserted into the hash table.
- Incorrect equals() and hashCode() implementations: In languages like Java, if you override equals(), you must also override hashCode() to ensure that equal objects have equal hash codes. If you don’t, your hash table will not work correctly.

7. Classic Problem Examples

Example: Shortest Palindrome

Description: You are given a string s. You can convert s to a palindrome by adding characters in front of it. Return the shortest palindrome you can find by performing this transformation.

High-Level Approach:

String Manipulation: The core idea is to find the longest palindrome starting from the beginning of the input string s.
Hashing/KMP: One efficient way to solve this is to use a rolling hash or the Knuth-Morris-Pratt (KMP) algorithm. Here, we’ll outline the hashing approach.
Formulate a Combined String: Create a new string t = s + "#" + s.reverse(). The ”#” is a delimiter.
Rolling Hash: Iterate through t calculating rolling hashes for the prefixes of s and the suffixes of s.reverse(). Compare the hashes.
Find the Longest Palindromic Prefix: The longest match corresponds to the longest palindromic prefix of s.
Construct the Shortest Palindrome: Take the remaining part of s (the non-palindromic suffix), reverse it, and prepend it to the original string s.

def shortestPalindrome(s: str) -> str:
    """
    Finds the shortest palindrome by adding characters to the front of the input string.
    """
    n = len(s)
    for i in range(n - 1, -1, -1):
        if s[: i + 1] == s[: i + 1][::-1]:
            return s[i + 1:][::-1] + s
    return s[::-1] + s

This solution doesn’t explicitly use a hash table, but it uses the concept of comparing prefixes with reversed suffixes, a technique often facilitated by hash functions for efficiency. The time complexity is O(n^2) in the worst case.