Bucket Sort

bucket-sort: The Ultimate Cheatsheet

1. Detailed Explanation

Bucket sort is a distribution sort that works by distributing the elements of an array into a number of buckets. Each bucket is then sorted individually, either using a different sorting algorithm or recursively applying the bucket sort algorithm. Finally, the sorted buckets are concatenated to produce the sorted output.

Why is it important? Bucket sort excels when the input data is uniformly distributed. It offers a time complexity that’s significantly better than comparison-based sorts like merge sort or quicksort in these scenarios, making it a powerful tool for specific problem types.

Core Concepts:

Buckets: Containers that hold a subset of the input data. The number of buckets is a crucial parameter affecting performance.
Uniform Distribution: The effectiveness of bucket sort hinges on the input data being (approximately) uniformly distributed across the range of possible values.
Sub-sorting: Each bucket is typically sorted using a simpler algorithm like insertion sort (efficient for small, nearly sorted lists) or recursively using bucket sort itself.

2. When to Use bucket-sort (and When Not To)

Use Bucket Sort when:

The input data is uniformly or nearly uniformly distributed across a known range.
The number of buckets is chosen appropriately (not too few, not too many).
You need a linear-time sorting algorithm for specific data distributions.

Don’t Use Bucket Sort when:

The input data is heavily skewed or clustered. Performance degrades significantly in this case.
The range of input values is unknown or extremely large.
The cost of creating and managing buckets outweighs the benefits. This is especially true for small datasets.

3. Core Algorithm / Data Structure Template

Initialize Buckets: Create an array of empty buckets (lists or arrays). The number of buckets should be chosen based on the range of input values. A good starting point is the square root of the number of elements.
Distribute Elements: Iterate through the input array. For each element, determine which bucket it belongs to based on its value and place it in that bucket.
Sort Buckets: Sort each bucket individually using a suitable sorting algorithm (e.g., insertion sort).
Concatenate Buckets: Concatenate the sorted buckets to produce the final sorted array.

4. Code Implementations

Python

def bucket_sort(arr):
    if not arr:
        return arr

    min_val = min(arr)
    max_val = max(arr)
    num_buckets = max(1, int((max_val - min_val + 1)**0.5)) # Adjust bucket number as needed

    buckets = [[] for _ in range(num_buckets)]
    bucket_range = (max_val - min_val + 1) / num_buckets

    for num in arr:
        index = int((num - min_val) // bucket_range)
        buckets[index].append(num)

    for i in range(num_buckets):
        buckets[i].sort() # Use insertion sort or another suitable algorithm

    sorted_arr = []
    for bucket in buckets:
        sorted_arr.extend(bucket)

    return sorted_arr

Java

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;

public class BucketSort {
    public static List<Integer> bucketSort(List<Integer> arr) {
        if (arr == null || arr.isEmpty()) {
            return arr;
        }

        int minVal = Collections.min(arr);
        int maxVal = Collections.max(arr);
        int numBuckets = Math.max(1, (int) Math.sqrt(arr.size()));

        List<List<Integer>> buckets = new ArrayList<>(numBuckets);
        for (int i = 0; i < numBuckets; i++) {
            buckets.add(new ArrayList<>());
        }

        double bucketRange = (double) (maxVal - minVal + 1) / numBuckets;

        for (int num : arr) {
            int index = (int) ((num - minVal) / bucketRange);
            buckets.get(index).add(num);
        }

        for (List<Integer> bucket : buckets) {
            Collections.sort(bucket); // Use Collections.sort or another suitable method
        }

        List<Integer> sortedArr = new ArrayList<>();
        for (List<Integer> bucket : buckets) {
            sortedArr.addAll(bucket);
        }

        return sortedArr;
    }

    public static void main(String[] args) {
        List<Integer> arr = Arrays.asList(5, 2, 9, 1, 5, 6);
        List<Integer> sortedArr = bucketSort(arr);
        System.out.println(sortedArr); // Output: [1, 2, 5, 5, 6, 9]
    }
}

C++

#include <iostream>
#include <vector>
#include <algorithm>
#include <cmath>

using namespace std;

vector<int> bucketSort(vector<int>& arr) {
    if (arr.empty()) {
        return arr;
    }

    int minVal = *min_element(arr.begin(), arr.end());
    int maxVal = *max_element(arr.begin(), arr.end());
    int numBuckets = max(1, (int)sqrt(arr.size()));

    vector<vector<int>> buckets(numBuckets);
    double bucketRange = (double)(maxVal - minVal + 1) / numBuckets;

    for (int num : arr) {
        int index = (int)((num - minVal) / bucketRange);
        buckets[index].push_back(num);
    }

    for (auto& bucket : buckets) {
        sort(bucket.begin(), bucket.end()); // Use std::sort
    }

    vector<int> sortedArr;
    for (const auto& bucket : buckets) {
        sortedArr.insert(sortedArr.end(), bucket.begin(), bucket.end());
    }

    return sortedArr;
}

int main() {
    vector<int> arr = {5, 2, 9, 1, 5, 6};
    vector<int> sortedArr = bucketSort(arr);
    for (int num : sortedArr) {
        cout << num << " "; // Output: 1 2 5 5 6 9
    }
    cout << endl;
    return 0;
}

5. Complexity Analysis

Case	Time Complexity	Space Complexity
Best	O(n + k)	O(n + k)
Average	O(n + k)	O(n + k)
Worst	O(n^2)	O(n + k)

Where:

n is the number of elements in the input array.
k is the number of buckets. In a well-behaved scenario where k is approximately the square root of n, the complexity becomes O(n). However, in the worst-case scenario (all elements in one bucket), we end up sorting a list of size n, resulting in O(n^2) complexity if a simple sorting algorithm like insertion sort is used within each bucket.

6. Pro Tips, Tricks, and Common Pitfalls

Bucket Size: The choice of the number of buckets significantly impacts performance. Too few buckets lead to large buckets and inefficient sorting within them. Too many buckets increase overhead. Experiment to find the optimal value for your data. The square root of the input size is often a good starting point.
Sub-sorting Algorithm: The algorithm used to sort individual buckets matters. Insertion sort is a good choice for smaller buckets due to its efficiency on nearly sorted data. For larger buckets, consider more sophisticated algorithms.
Data Distribution: Bucket sort’s efficiency is directly tied to the uniformity of the input data. If your data is highly skewed, consider alternative sorting algorithms.
Handling Floating-Point Numbers: When dealing with floating-point numbers, you’ll need to handle floating-point precision carefully when determining bucket indices to avoid unexpected behavior.

7. Classic Problem Examples

Example: Maximum Gap

Description: Given an integer array nums, return the maximum difference between two successive elements in its sorted form. If the array contains less than two elements, return 0. You must write an algorithm that runs in linear time and uses linear extra space.

High-Level Approach using Bucket Sort:

Find the minimum and maximum values in the array.
Create buckets based on the range of values (similar to the bucket sort implementation above). The number of buckets should be chosen carefully; a good rule of thumb is min(n-1, (max - min) / range) where range is a sensible value that prevents too many buckets (e.g., 1 or a small constant).
Distribute the elements into the buckets. If a bucket is empty, record that.
Iterate through the buckets. The maximum gap will occur between the maximum value in one bucket and the minimum value in the next non-empty bucket. Track the maximum gap found.
Return the maximum gap. If there is only one non-empty bucket, return 0.

This approach leverages the bucket sort idea to distribute the elements efficiently. The actual sorting within the buckets is not strictly necessary because we only need the minimum and maximum values from each non-empty bucket to calculate the maximum gap. The linear time complexity comes from the fact that we are only iterating once through the input and once through the buckets. The space complexity is linear because the number of buckets is linear to the range of the input.