Hash Table

hash-table: The Ultimate Cheatsheet

1. Detailed Explanation

A hash table (also known as a hash map) is a data structure that implements an associative array abstract data type, a structure that can map keys to values. It uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found. This allows for very fast lookups, insertions, and deletions on average.

Why is it important? Hash tables are crucial because they provide O(1) average-case time complexity for key-based operations (insertion, deletion, lookup), making them significantly faster than other data structures like balanced binary search trees (O(log n) ) for these operations, especially for large datasets. This efficiency is essential in numerous applications.

Core Concepts:

Hash Function: A function that maps keys to integer indices (hash codes). A good hash function should distribute keys uniformly across the hash table to minimize collisions.
Hash Table: An array of buckets or slots, where each bucket can store one or more key-value pairs.
Collision: When two or more keys hash to the same index. Collision handling techniques are necessary to resolve this.
Collision Handling: Strategies to manage collisions, such as separate chaining (each bucket is a linked list) or open addressing (probing for the next available slot).
Load Factor: The ratio of the number of entries to the number of buckets. A high load factor increases the likelihood of collisions, impacting performance. Resizing the hash table is often necessary to maintain a low load factor.

2. When to Use hash-table (and When Not To)

Use hash tables when:

You need fast average-case lookups, insertions, and deletions.
You need to store data associated with unique keys.
You are dealing with a large dataset where O(1) average-case time complexity is crucial.
You need to check for the existence of an element quickly.

Don’t use hash tables when:

You need ordered data (use a sorted array or balanced tree).
You need to efficiently find the kth smallest/largest element (use a heap or balanced tree).
You have limited memory and collision handling overhead is unacceptable. In such scenarios, consider using a more space-efficient structure, even if it means slightly slower operations.
Key uniqueness is not guaranteed and you need to handle duplicates efficiently (consider using a different data structure depending on how duplicates should be treated).

3. Core Algorithm / Data Structure Template

This pseudocode outlines a basic hash table implementation using separate chaining for collision handling:

// Hash Table using Separate Chaining

// Initialize: Create an array of linked lists (buckets) of a specified size.

// Hash Function:  A function that takes a key as input and returns an integer index (0 to table size - 1).

// Insert(key, value):
  1. Calculate the hash code (index) for the key using the hash function.
  2. Insert the (key, value) pair into the linked list at the calculated index.

// Get(key):
  1. Calculate the hash code (index) for the key using the hash function.
  2. Traverse the linked list at that index.
  3. If the key is found, return the associated value. Otherwise, return null or throw an exception.

// Delete(key):
  1. Calculate the hash code (index) for the key using the hash function.
  2. Traverse the linked list at that index.
  3. If the key is found, remove the (key, value) pair from the linked list.

4. Code Implementations

Python

class HashTable:
    def __init__(self, capacity):
        self.capacity = capacity
        self.table = [[] for _ in range(capacity)]

    def __len__(self):
        return sum(len(chain) for chain in self.table)

    def __contains__(self, key):
        index = hash(key) % self.capacity
        for k, _ in self.table[index]:
            if k == key:
                return True
        return False

    def insert(self, key, value):
        index = hash(key) % self.capacity
        for i, (k, v) in enumerate(self.table[index]):
            if k == key:
                self.table[index][i] = (key, value)  # Update existing key
                return
        self.table[index].append((key, value))

    def get(self, key):
        index = hash(key) % self.capacity
        for k, v in self.table[index]:
            if k == key:
                return v
        return None

    def delete(self, key):
        index = hash(key) % self.capacity
        for i, (k, v) in enumerate(self.table[index]):
            if k == key:
                del self.table[index][i]
                return

Java

import java.util.LinkedList;
import java.util.List;

class HashTable {
    private List<List<Entry>> table;
    private int capacity;

    static class Entry {
        String key;
        int value;

        Entry(String key, int value) {
            this.key = key;
            this.value = value;
        }
    }

    public HashTable(int capacity) {
        this.capacity = capacity;
        this.table = new LinkedList<>();
        for (int i = 0; i < capacity; i++) {
            table.add(new LinkedList<>());
        }
    }

    private int hash(String key) {
        return Math.abs(key.hashCode()) % capacity;
    }

    public void insert(String key, int value) {
        int index = hash(key);
        for (Entry entry : table.get(index)) {
            if (entry.key.equals(key)) {
                entry.value = value; // Update existing key
                return;
            }
        }
        table.get(index).add(new Entry(key, value));
    }

    public Integer get(String key) {
        int index = hash(key);
        for (Entry entry : table.get(index)) {
            if (entry.key.equals(key)) {
                return entry.value;
            }
        }
        return null;
    }

    public void delete(String key) {
        int index = hash(key);
        table.get(index).removeIf(entry -> entry.key.equals(key));
    }
}

C++

#include <iostream>
#include <vector>
#include <list>
#include <string>

using namespace std;

class HashTable {
private:
    vector<list<pair<string, int>>> table;
    int capacity;

    int hash(const string& key) {
        int hashValue = 0;
        for (char c : key) {
            hashValue = 31 * hashValue + c;
        }
        return abs(hashValue) % capacity;
    }

public:
    HashTable(int capacity) : capacity(capacity), table(capacity) {}

    void insert(const string& key, int value) {
        int index = hash(key);
        for (auto& entry : table[index]) {
            if (entry.first == key) {
                entry.second = value; // Update existing key
                return;
            }
        }
        table[index].push_back({key, value});
    }

    int get(const string& key) {
        int index = hash(key);
        for (const auto& entry : table[index]) {
            if (entry.first == key) {
                return entry.second;
            }
        }
        return -1; // Or throw an exception
    }

    void deleteKey(const string& key) {
        int index = hash(key);
        table[index].remove_if([&](const pair<string, int>& entry) { return entry.first == key; });
    }
};

5. Complexity Analysis

Operation	Best Case	Average Case	Worst Case	Space Complexity
Insertion	O(1)	O(1)	O(n)	O(n)
Deletion	O(1)	O(1)	O(n)	O(n)
Search/Lookup	O(1)	O(1)	O(n)	O(n)

Explanation:

Best Case: O(1) – The key hashes to a bucket with no collisions.
Average Case: O(1) – The key hashes to a bucket with a few collisions (assuming a good hash function and reasonable load factor).
Worst Case: O(n) – All keys hash to the same bucket (extremely unlikely with a good hash function, but possible). This is why a good hash function and load factor management are essential.
Space Complexity: O(n) – The space used is proportional to the number of elements stored.

6. Pro Tips, Tricks, and Common Pitfalls

Pro Tips:

Choose a good hash function: A well-distributed hash function minimizes collisions and improves performance. Consider using built-in hash functions or libraries optimized for your programming language.
Handle resizing: As the load factor increases, resize the hash table to maintain performance. Dynamic resizing strategies (e.g., doubling the size when the load factor exceeds a threshold) are common.
Consider different collision handling techniques: Separate chaining is simple, but open addressing (linear probing, quadratic probing, double hashing) can be more space-efficient, although more complex to implement correctly.
Prime numbers for capacity: Using a prime number for the hash table capacity can sometimes improve hash function distribution.

Common Pitfalls:

Poor hash function: A poorly designed hash function leads to many collisions, degrading performance to O(n).
Ignoring load factor: Failing to resize the hash table when the load factor gets too high leads to performance degradation.
Incorrect collision handling: Implementing collision resolution incorrectly can lead to bugs and unpredictable behavior.
Infinite loops in open addressing: If open addressing is used without proper handling of full tables, it can lead to infinite loops.

7. Classic Problem Examples

Example: Longest Substring Without Repeating Characters

Description: Given a string s, find the length of the longest substring without repeating characters.

High-level approach using a hash table:

Use a sliding window approach. Maintain a window within the string.
Use a hash table (dictionary in Python) to store the characters in the current window and their indices.
As you expand the window to the right, check if the new character is already in the hash table.
If it is, shrink the window from the left until the duplicate is removed.
Update the maximum length of the substring found so far.
Repeat until the entire string is processed. The hash table efficiently tracks the presence and index of each character within the current window.

Example 1: “abcabcbb” Longest substring is “abc”, length 3. Example 2: “bbbbb” Longest substring is “b”, length 1. Example 3: “pwwkew” Longest substring is “wke”, length 3.

(Note: Complete code implementations for this problem and other classic examples are beyond the scope of this cheatsheet, but the high-level approach using a hash table is provided to illustrate its application.)