Window Functions (ROW_NUMBER, RANK, LAG, LEAD)

Difficulty: Advanced
Generated on: 2025-07-10 02:29:28
Category: SQL Cheatsheet for Database Development

SQL Window Functions Cheatsheet (Advanced) - ROW_NUMBER, RANK, LAG, LEAD

1. Quick Overview

Window functions perform calculations across a set of table rows that are related to the current row. Unlike aggregate functions that collapse multiple rows into a single output row, window functions retain the individual rows. They allow you to calculate running totals, ranks, moving averages, and more within a result set.

When to Use:

Ranking rows within groups
Calculating differences between rows
Generating sequential numbers
Analyzing trends over time
Calculating cumulative sums or averages

2. Syntax

SELECT
    column1,
    column2,
    WINDOW_FUNCTION(argument1, argument2, ...) OVER (
        [PARTITION BY column_list]
        ORDER BY column_list [ASC | DESC]
        [ROWS | RANGE BETWEEN frame_start AND frame_end]
    ) AS alias
FROM
    table_name;

WINDOW_FUNCTION(): The window function to use (e.g., ROW_NUMBER(), RANK(), LAG(), LEAD()).
PARTITION BY: Divides the rows into partitions (groups) based on the specified column(s). The window function is applied separately to each partition. If omitted, the entire table is treated as a single partition.
ORDER BY: Defines the order of rows within each partition. Crucial for functions like ROW_NUMBER, RANK, LAG, and LEAD.
ROWS | RANGE BETWEEN frame_start AND frame_end (Optional): Defines the window frame (subset of rows within the partition) used for the calculation. If omitted, the default frame is all rows from the beginning of the partition to the current row.
- frame_start and frame_end can be:
  - UNBOUNDED PRECEDING
  - n PRECEDING (where n is an integer)
  - CURRENT ROW
  - n FOLLOWING (where n is an integer)
  - UNBOUNDED FOLLOWING

Example (Basic):

SELECT
    employee_id,
    salary,
    ROW_NUMBER() OVER (ORDER BY salary DESC) AS salary_rank
FROM
    employees;

3. Common Use Cases

Window Function	Use Case
`ROW_NUMBER()`	Assigning a unique sequential integer to each row within a partition.
`RANK()`	Assigning a rank to each row within a partition based on a specific value. Handles ties.
`LAG()`	Accessing data from a previous row within a partition.
`LEAD()`	Accessing data from a subsequent row within a partition.

Detailed Use Cases:

ROW_NUMBER():
- Pagination: Implementing numbered pages in search results.
- Deduplication: Selecting only the most recent record for each unique key.
- Ranking: Simple ranking without handling ties.
RANK():
- Leaderboards: Ranking players based on their score.
- Performance Analysis: Ranking employees based on sales figures.
- Identifying Top Performers: Selecting the top N performers in each category.
LAG():
- Calculating Differences: Finding the difference between the current value and the previous value (e.g., change in stock price).
- Comparing Values Over Time: Analyzing sales growth from one month to the next.
- Anomaly Detection: Identifying unusual spikes or dips in data.
LEAD():
- Forecasting: Using future values to predict trends.
- Scheduling: Determining the next scheduled event based on current data.
- Calculating Time to Next Event: Finding the time difference between the current event and the next event.

4. Best Practices

Use PARTITION BY judiciously: Partitions can significantly affect performance. Only partition when necessary.
Optimize ORDER BY: The ORDER BY clause is critical for window functions. Ensure the columns used in ORDER BY are indexed.
Window Frame Specification: Carefully define the window frame (using ROWS or RANGE) to avoid unnecessary calculations. The default frame (rows from beginning of partition to current row) might not always be what you want.
Avoid Nested Window Functions (if possible): Nested window functions can negatively impact performance. Consider alternative approaches if possible. While some databases support them, they can be hard to read and debug.
Understand Data Distribution: Uneven data distribution within partitions can affect performance. Consider pre-aggregating data if necessary.
Test Thoroughly: Window functions can be complex. Test your queries with various data sets to ensure they produce the expected results.
Use indexes: If your ORDER BY clause includes a column that is frequently used in queries, create an index on that column. This can significantly speed up the execution of window functions.

5. Examples

Sample Data ( sales table`):

CREATE TABLE sales (
    sale_id INT PRIMARY KEY,
    product_id INT,
    sale_date DATE,
    amount DECIMAL(10, 2),
    region VARCHAR(50)
);

INSERT INTO sales (sale_id, product_id, sale_date, amount, region) VALUES
(1, 101, '2023-01-15', 150.00, 'North'),
(2, 102, '2023-01-20', 200.00, 'North'),
(3, 101, '2023-02-10', 180.00, 'North'),
(4, 103, '2023-02-25', 250.00, 'North'),
(5, 101, '2023-01-05', 120.00, 'South'),
(6, 102, '2023-01-18', 180.00, 'South'),
(7, 103, '2023-02-01', 220.00, 'South'),
(8, 101, '2023-02-15', 200.00, 'South');

Example 1: ROW_NUMBER() - Assigning a rank to sales within each region based on amount.

SELECT
    sale_id,
    product_id,
    sale_date,
    amount,
    region,
    ROW_NUMBER() OVER (PARTITION BY region ORDER BY amount DESC) AS region_sale_rank
FROM
    sales;

Output:

sale_id	product_id	sale_date	amount	region	region_sale_rank
4	103	2023-02-25	250.00	North	1
2	102	2023-01-20	200.00	North	2
3	101	2023-02-10	180.00	North	3
1	101	2023-01-15	150.00	North	4
7	103	2023-02-01	220.00	South	1
8	101	2023-02-15	200.00	South	2
6	102	2023-01-18	180.00	South	3
5	101	2023-01-05	120.00	South	4

Example 2: RANK() - Ranking sales within each region based on amount, handling ties.

SELECT
    sale_id,
    product_id,
    sale_date,
    amount,
    region,
    RANK() OVER (PARTITION BY region ORDER BY amount DESC) AS region_sale_rank
FROM
    sales;

(If two sales in the same region had the same amount, they would receive the same rank.)

Example 3: LAG() - Calculating the difference in sales amount from the previous sale within each region.

SELECT
    sale_id,
    product_id,
    sale_date,
    amount,
    region,
    LAG(amount, 1, 0) OVER (PARTITION BY region ORDER BY sale_date) AS previous_sale_amount,
    amount - LAG(amount, 1, 0) OVER (PARTITION BY region ORDER BY sale_date) AS amount_difference
FROM
    sales;

Output:

sale_id	product_id	sale_date	amount	region	previous_sale_amount	amount_difference
1	101	2023-01-15	150.00	North	0.00	150.00
2	102	2023-01-20	200.00	North	150.00	50.00
3	101	2023-02-10	180.00	North	200.00	-20.00
4	103	2023-02-25	250.00	North	180.00	70.00
5	101	2023-01-05	120.00	South	0.00	120.00
6	102	2023-01-18	180.00	South	120.00	60.00
7	103	2023-02-01	220.00	South	180.00	40.00
8	101	2023-02-15	200.00	South	220.00	-20.00

(Note: The LAG function’s third argument (0 in this case) is the default value if there’s no previous row.)

Example 4: LEAD() - Finding the next sale date for each product.

SELECT
    sale_id,
    product_id,
    sale_date,
    amount,
    region,
    LEAD(sale_date, 1, NULL) OVER (PARTITION BY product_id ORDER BY sale_date) AS next_sale_date
FROM
    sales;

Output:

sale_id	product_id	sale_date	amount	region	next_sale_date
5	101	2023-01-05	120.00	South	2023-01-15
1	101	2023-01-15	150.00	North	2023-02-10
3	101	2023-02-10	180.00	North	2023-02-15
8	101	2023-02-15	200.00	South	NULL
6	102	2023-01-18	180.00	South	2023-01-20
2	102	2023-01-20	200.00	North	NULL
7	103	2023-02-01	220.00	South	2023-02-25
4	103	2023-02-25	250.00	North	NULL

(Note: The LEAD function’s third argument (NULL in this case) is the default value if there’s no subsequent row.)

6. Common Pitfalls

Missing ORDER BY: For ROW_NUMBER(), RANK(), LAG(), and LEAD(), the ORDER BY clause is crucial. Omitting it (or using an incorrect order) will lead to unpredictable results.
Incorrect PARTITION BY: Incorrect partitioning will apply the window function across the wrong set of rows, leading to incorrect calculations.
Confusing ROWS and RANGE: ROWS specifies the frame in terms of the number of rows, while RANGE specifies the frame based on the value of the ORDER BY column. Choose the appropriate frame specification based on your requirements. RANGE must have exactly one ORDER BY expression.
Performance Issues: Window functions can be resource-intensive. Monitor query execution plans and optimize as needed. Improper use can lead to full table scans.
Incorrect Default Values for LAG() and LEAD(): If you don’t provide a default value, LAG() and LEAD() will return NULL for the first/last row in a partition. Consider whether NULL is the desired behavior.

Troubleshooting Tips:

Examine Execution Plans: Use your database’s query execution plan tool to identify performance bottlenecks.
Simplify the Query: Remove unnecessary columns and clauses to isolate the problem.
Test with Small Data Sets: Use a small subset of your data to test your window function logic.
Break Down the Query: Calculate intermediate results using subqueries to understand the behavior of each part of the query.

7. Database Variations

While the core syntax is generally consistent, there are some database-specific variations:

MySQL: MySQL 8.0+ fully supports window functions. Older versions have limited support.
PostgreSQL: PostgreSQL has robust window function support, including advanced features like custom window functions.
SQL Server: SQL Server supports window functions from SQL Server 2005 onwards.
Oracle: Oracle has comprehensive window function support, including analytic functions that are similar to window functions.

Specific Differences:

Syntax: Minor syntax variations may exist, such as the placement of the OVER() clause.
Function Availability: Some databases may not support all window functions.
Performance: Performance characteristics can vary significantly across databases.
Frame Specification: The syntax for specifying the window frame (using ROWS or RANGE) may differ slightly.

Always consult the documentation for your specific database version for the most accurate and up-to-date information. Pay attention to database-specific performance tuning recommendations for window functions.