Partitioning and Sharding

Difficulty: Advanced
Generated on: 2025-07-10 02:34:19
Category: SQL Cheatsheet for Database Development

SQL Partitioning & Sharding Cheatsheet (Advanced)

1. Quick Overview

Partitioning: Dividing a single logical table into smaller, more manageable physical pieces within a single database instance. Improves query performance, manageability, and storage efficiency. All partitions reside on the same database server.

Sharding: Dividing a single logical database into multiple physical databases (shards), typically distributed across multiple servers. Improves scalability, availability, and performance by distributing the workload. Data is split across different database servers.

When to Use:

Partitioning:
- Large tables with frequent queries on specific subsets of data (e.g., time-series data).
- Archiving old data without impacting query performance on recent data.
- Improving query performance by only scanning relevant partitions.
- Easier data management (e.g., backup, restore, reindexing).
Sharding:
- Database is too large to fit on a single server.
- High write load that a single server cannot handle.
- Need to scale the database horizontally.
- Geographical distribution of users (geo-sharding).

2. Syntax

Partitioning (MySQL Example):

CREATE TABLE sales (
    sale_id INT PRIMARY KEY,
    sale_date DATE NOT NULL,
    product_id INT,
    amount DECIMAL(10, 2)
)
PARTITION BY RANGE ( YEAR(sale_date) ) (
    PARTITION p2020 VALUES LESS THAN (2021),
    PARTITION p2021 VALUES LESS THAN (2022),
    PARTITION p2022 VALUES LESS THAN (2023),
    PARTITION pFuture VALUES LESS THAN MAXVALUE
);

-- Add a new partition
ALTER TABLE sales ADD PARTITION (PARTITION p2023 VALUES LESS THAN (2024));

-- Drop a partition
ALTER TABLE sales DROP PARTITION p2020;

Partitioning (PostgreSQL Example - Declarative Partitioning):

CREATE TABLE sales (
    sale_id INT PRIMARY KEY,
    sale_date DATE NOT NULL,
    product_id INT,
    amount DECIMAL(10, 2)
) PARTITION BY RANGE (sale_date);

CREATE TABLE sales_y2020 PARTITION OF sales FOR VALUES FROM ('2020-01-01') TO ('2021-01-01');
CREATE TABLE sales_y2021 PARTITION OF sales FOR VALUES FROM ('2021-01-01') TO ('2022-01-01');
CREATE TABLE sales_y2022 PARTITION OF sales FOR VALUES FROM ('2022-01-01') TO ('2023-01-01');
CREATE TABLE sales_future PARTITION OF sales FOR VALUES FROM ('2023-01-01') TO (MAXVALUE);

-- Create an index on each partition.  Very important for performance!
CREATE INDEX idx_sales_y2020 ON sales_y2020 (sale_date);
CREATE INDEX idx_sales_y2021 ON sales_y2021 (sale_date);
CREATE INDEX idx_sales_y2022 ON sales_y2022 (sale_date);
CREATE INDEX idx_sales_future ON sales_future (sale_date);

Sharding (Conceptual - No Standard SQL Syntax):

Sharding requires application-level logic or middleware to route queries to the correct shard. There is no built-in SQL syntax for sharding itself. The following is a conceptual example:

Define Shard Key: user_id (for example)
Shard Function: shard_id = user_id % number_of_shards
Application Logic:
- When a query comes in for a specific user_id, calculate the shard_id.
- Route the query to the database server corresponding to that shard_id.

Key Concepts:

Partition Key: The column(s) used to determine which partition a row belongs to.
Partition Function: Defines how rows are assigned to partitions (e.g., RANGE, LIST, HASH).
Shard Key: The column(s) used to determine which shard a row belongs to.
Shard Function: Defines how rows are assigned to shards (e.g., consistent hashing, range-based).

3. Common Use Cases

Time-Series Data: Partitioning or sharding by date or time ranges (e.g., logs, sensor data, financial transactions).
User Data: Sharding by user ID to distribute the load of user-specific queries.
Geographic Data: Sharding by region to improve performance for users in specific locations.
Event Data: Partitioning or sharding by event type or timestamp.
Archiving: Moving old data to separate partitions for archival purposes.

Example: Time-Series Data (Partitioning)

Imagine a table storing website traffic data:

CREATE TABLE page_views (
    view_id BIGINT PRIMARY KEY,
    page_url VARCHAR(255) NOT NULL,
    view_timestamp TIMESTAMP NOT NULL,
    user_id INT
)
PARTITION BY RANGE ( EXTRACT(YEAR FROM view_timestamp) ) (
    PARTITION p2020 VALUES LESS THAN (2021),
    PARTITION p2021 VALUES LESS THAN (2022),
    PARTITION p2022 VALUES LESS THAN (2023),
    PARTITION p2023 VALUES LESS THAN (2024),
    PARTITION pFuture VALUES LESS THAN MAXVALUE
);

Queries like SELECT * FROM page_views WHERE view_timestamp BETWEEN '2021-06-01' AND '2021-06-30' will only scan the p2021 partition, significantly improving performance.

Example: User Data (Sharding)

Assume you have a users table with billions of rows. You can shard it by user_id:

Shard Key: user_id
Number of Shards: 4
Shard Function: shard_id = user_id % 4

Users with user_id divisible by 4 go to shard 0.
Users with user_id having a remainder of 1 when divided by 4 go to shard 1, and so on.

Application logic routes queries based on the user_id to the correct shard.

4. Best Practices

Choose the Right Partition/Shard Key: Crucial for performance. Consider query patterns and data distribution. A poorly chosen key can lead to uneven data distribution and hotspots.
Ensure Even Data Distribution: Avoid skewed partitions or shards. Use a hash-based partition/shard function if necessary.
Index All Partitions/Shards: Indexes are essential for query performance. Create indexes on the partition/shard key and any other frequently queried columns.
Partition Pruning: Ensure the database can effectively eliminate irrelevant partitions during query execution. Use the partition key in your WHERE clauses.
Minimize Cross-Partition/Shard Queries: Cross-partition/shard queries can be slow. Design your schema and queries to minimize them. Consider denormalization if necessary.
Transaction Management: Sharding introduces complexities for transactions that span multiple shards. Consider using distributed transactions or eventual consistency.
Monitoring: Monitor partition/shard size, query performance, and resource utilization.
Backup and Recovery: Develop a backup and recovery strategy for partitioned or sharded databases.
Automate Partition Management: Use scripts or tools to automate the creation, deletion, and maintenance of partitions.
Connection Pooling: When using sharding, ensure proper connection pooling to each shard to avoid connection exhaustion.
Test Thoroughly: Test your partitioning/sharding strategy thoroughly with realistic data and query patterns.

5. Examples

Example 1: Range Partitioning (MySQL) - Sales Data

CREATE TABLE sales (
    sale_id INT PRIMARY KEY,
    sale_date DATE NOT NULL,
    product_id INT,
    amount DECIMAL(10, 2)
)
PARTITION BY RANGE ( YEAR(sale_date) ) (
    PARTITION p2020 VALUES LESS THAN (2021),
    PARTITION p2021 VALUES LESS THAN (2022),
    PARTITION p2022 VALUES LESS THAN (2023),
    PARTITION pFuture VALUES LESS THAN MAXVALUE
);

INSERT INTO sales (sale_id, sale_date, product_id, amount) VALUES
(1, '2020-05-15', 101, 50.00),
(2, '2021-08-22', 102, 75.50),
(3, '2022-12-01', 103, 100.00),
(4, '2023-01-10', 101, 60.00);

-- Querying a specific partition
SELECT * FROM sales PARTITION (p2021) WHERE product_id = 102;

Example 2: List Partitioning (MySQL) - Customer Data by Region

CREATE TABLE customers (
    customer_id INT PRIMARY KEY,
    customer_name VARCHAR(255) NOT NULL,
    region VARCHAR(50) NOT NULL
)
PARTITION BY LIST (region) (
    PARTITION p_north VALUES IN ('North', 'Northeast'),
    PARTITION p_south VALUES IN ('South', 'Southeast'),
    PARTITION p_west VALUES IN ('West', 'Northwest'),
    PARTITION p_central VALUES IN ('Central')
);

INSERT INTO customers (customer_id, customer_name, region) VALUES
(1, 'Alice Smith', 'North'),
(2, 'Bob Johnson', 'South'),
(3, 'Charlie Brown', 'West'),
(4, 'David Lee', 'Central');

SELECT * FROM customers PARTITION (p_south);

Example 3: Hash Partitioning (MySQL) - Even Data Distribution

CREATE TABLE logs (
    log_id BIGINT PRIMARY KEY,
    log_timestamp TIMESTAMP NOT NULL,
    message VARCHAR(255)
)
PARTITION BY HASH(log_id)
PARTITIONS 4; -- Creates 4 partitions

Example 4: PostgreSQL - Range Partitioning with Inheritance

CREATE TABLE measurement (
    city_id         int not null,
    logdate         date not null,
    peaktemp        int,
    unitsales       int
) PARTITION BY RANGE (logdate);

CREATE TABLE measurement_y2016m01 PARTITION OF measurement
    FOR VALUES FROM ('2016-01-01') TO ('2016-02-01');

CREATE TABLE measurement_y2016m02 PARTITION OF measurement
    FOR VALUES FROM ('2016-02-01') TO ('2016-03-01');

Sample Data (for the PostgreSQL example):

INSERT INTO measurement(city_id, logdate, peaktemp, unitsales) VALUES (1, '2016-01-05', 50, 20);
INSERT INTO measurement(city_id, logdate, peaktemp, unitsales) VALUES (2, '2016-02-15', 60, 30);

6. Common Pitfalls

Incorrect Partition/Shard Key: Leads to uneven data distribution and poor performance.
Missing Indexes: Significantly degrades query performance.
Cross-Partition/Shard Queries: Can be very slow, especially with large datasets.
Hotspots: Some partitions/shards receive significantly more traffic than others, leading to performance bottlenecks.
Complexity: Partitioning and sharding add complexity to database management and application development.
Data Consistency Issues: Sharding can make it more difficult to maintain data consistency, especially with distributed transactions.
Ignoring Partition Pruning: Queries not designed to take advantage of partition pruning will scan all partitions.
Insufficient Monitoring: Lack of monitoring can lead to undetected performance problems.
Schema Changes: Schema changes can be more difficult to implement in partitioned or sharded environments. Careful planning is essential.
Lack of Automation: Manual partition management is error-prone and time-consuming.

Troubleshooting Tips:

Explain Plans: Use EXPLAIN (PostgreSQL, MySQL) or similar tools to analyze query execution plans and identify performance bottlenecks.
Monitoring Tools: Use database monitoring tools to track partition/shard size, query performance, and resource utilization.
Logging: Log queries and execution times to identify slow queries.
Data Distribution Analysis: Analyze data distribution to identify skewed partitions/shards.

7. Database Variations

Feature	MySQL	PostgreSQL	SQL Server	Oracle
Partitioning	Native partitioning (RANGE, LIST, HASH, KEY)	Declarative partitioning (RANGE, LIST, HASH)	Partitioned tables and indexes (RANGE)	Partitioned tables and indexes (RANGE, LIST, HASH)
Syntax	`PARTITION BY ...`	`PARTITION BY ...`, `CREATE TABLE ... PARTITION OF ...`	`CREATE PARTITION FUNCTION`, `CREATE PARTITION SCHEME`	`CREATE TABLE ... PARTITION BY ...`
Subpartitioning	Supported	Supported	Not natively supported, requires workarounds	Supported
Online Operations	Limited online operations	More flexible online operations	Limited online operations	More flexible online operations
Sharding	No native sharding support	No native sharding support	No native sharding support	No native sharding support
Sharding Implementation	Requires application logic or middleware	Requires application logic or middleware	Requires application logic or middleware	Requires application logic or middleware

Specific Notes:

MySQL: Offers a wide range of partitioning options. Partition pruning is generally well-optimized.
PostgreSQL: Declarative partitioning provides a more flexible and maintainable approach. Foreign key constraints across partitions can be tricky. Requires more explicit index creation on partitions.
SQL Server: Uses partition functions and schemes to define partitioning. Partition switching can be used for efficient data loading and archiving.
Oracle: Offers advanced partitioning features, including interval partitioning and virtual column partitioning. Online operations are generally more flexible than in MySQL or SQL Server.

Important Considerations:

All major databases offer the ability to partition tables. The specific syntax and features vary.
Sharding is typically implemented at the application level or using middleware. There is no standard SQL syntax for sharding.
Careful planning and testing are essential for successful partitioning and sharding strategies.
Always consult the database documentation for the latest information and best practices.