Data Warehousing Concepts (OLAP, Star Schema)

Difficulty: Advanced
Generated on: 2025-07-10 02:36:33
Category: SQL Cheatsheet for Database Development

Data Warehousing & OLAP SQL Cheatsheet (Advanced)

1. Quick Overview

Data Warehousing: A system used for reporting and data analysis, and is considered a core component of business intelligence. Data warehouses are central repositories of integrated data from one or more disparate sources. They store historical data for analytical purposes.

OLAP (Online Analytical Processing): A category of software tools that provide analysis of data stored in a database. OLAP tools enable users to analyze different dimensions of multi-dimensional data.

Star Schema: A data warehouse schema with a central fact table surrounded by dimension tables. It’s designed for fast query performance and ease of understanding.

When to Use:

Data Warehousing: When you need to analyze historical data from multiple sources to make informed business decisions.
OLAP: When you need to perform complex queries that analyze data across multiple dimensions (e.g., sales by product, region, and time).
Star Schema: When you need a simple, efficient schema for data warehousing that supports fast query performance for OLAP operations.

2. Syntax

Creating a Fact Table (Sales Example):

CREATE TABLE fact_sales (
    sales_key INT PRIMARY KEY,
    date_key INT,
    product_key INT,
    customer_key INT,
    sales_amount DECIMAL(10, 2),
    quantity_sold INT,
    -- Add more measures as needed
    FOREIGN KEY (date_key) REFERENCES dim_date(date_key),
    FOREIGN KEY (product_key) REFERENCES dim_product(product_key),
    FOREIGN KEY (customer_key) REFERENCES dim_customer(customer_key)
);

Creating a Dimension Table (Date Example):

CREATE TABLE dim_date (
    date_key INT PRIMARY KEY,
    date DATE,
    year INT,
    quarter INT,
    month INT,
    day INT,
    day_of_week INT,
    -- Add more descriptive attributes as needed
    is_holiday BOOLEAN
);

Populating Dimension Tables:

-- Example: Populating the dim_date table
INSERT INTO dim_date (date_key, date, year, quarter, month, day, day_of_week, is_holiday)
VALUES (20230101, '2023-01-01', 2023, 1, 1, 1, 7, FALSE),
       (20230102, '2023-01-02', 2023, 1, 1, 2, 1, TRUE), -- Example Holiday
       (20230103, '2023-01-03', 2023, 1, 1, 3, 2, FALSE);

-- Example: Populating the dim_product table
INSERT INTO dim_product (product_key, product_name, category, price)
VALUES (1, 'Laptop', 'Electronics', 1200.00),
       (2, 'Keyboard', 'Electronics', 75.00);

-- Example: Populating the dim_customer table
INSERT INTO dim_customer (customer_key, customer_name, city, state)
VALUES (1, 'John Doe', 'New York', 'NY'),
       (2, 'Jane Smith', 'Los Angeles', 'CA');

-- Example: Populating the fact_sales table
INSERT INTO fact_sales (sales_key, date_key, product_key, customer_key, sales_amount, quantity_sold)
VALUES (1, 20230101, 1, 1, 1200.00, 1),
       (2, 20230101, 2, 2, 75.00, 1);

Basic OLAP Query (Sales by Product and Month):

SELECT
    d.year,
    d.month,
    p.product_name,
    SUM(f.sales_amount) AS total_sales
FROM
    fact_sales f
JOIN
    dim_date d ON f.date_key = d.date_key
JOIN
    dim_product p ON f.product_key = p.product_key
GROUP BY
    d.year,
    d.month,
    p.product_name
ORDER BY
    d.year,
    d.month,
    total_sales DESC;

Window Functions (Calculating Rolling Average):

SELECT
    d.date,
    SUM(f.sales_amount) AS daily_sales,
    AVG(SUM(f.sales_amount)) OVER (ORDER BY d.date ASC ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS rolling_7_day_average
FROM
    fact_sales f
JOIN
    dim_date d ON f.date_key = d.date_key
GROUP BY
    d.date
ORDER BY
    d.date;

CUBE and ROLLUP (Calculating Subtotals and Grand Totals):

-- ROLLUP:  Generates subtotals along a hierarchy (year -> quarter -> month)
SELECT
    d.year,
    d.quarter,
    d.month,
    SUM(f.sales_amount) AS total_sales
FROM
    fact_sales f
JOIN
    dim_date d ON f.date_key = d.date_key
GROUP BY ROLLUP (d.year, d.quarter, d.month)
ORDER BY d.year, d.quarter, d.month;

-- CUBE: Generates subtotals for all possible combinations of dimensions.  Can be computationally expensive.
SELECT
    d.year,
    d.quarter,
    p.category,
    SUM(f.sales_amount) AS total_sales
FROM
    fact_sales f
JOIN
    dim_date d ON f.date_key = d.date_key
JOIN
    dim_product p ON f.product_key = p.product_key
GROUP BY CUBE (d.year, d.quarter, p.category)
ORDER BY d.year, d.quarter, p.category;

3. Common Use Cases

Sales Analysis: Analyze sales trends by product, region, customer segment, and time period.
Inventory Management: Track inventory levels and optimize stock levels based on demand patterns.
Customer Segmentation: Identify different customer segments based on their purchasing behavior.
Financial Reporting: Generate financial reports for key performance indicators (KPIs).
Marketing Campaign Analysis: Evaluate the effectiveness of marketing campaigns.
Fraud Detection: Identify suspicious transactions and patterns.
Web Analytics: Track website traffic, user behavior, and conversion rates.

4. Best Practices

Use Surrogate Keys: Use integer surrogate keys for dimension tables instead of natural keys (e.g., product name). This improves performance and allows for changes in natural keys without affecting the fact table.
Proper Indexing: Index foreign key columns in the fact table and primary key columns in dimension tables. Consider indexing frequently used columns in dimension tables for filtering.
Partitioning: Partition large fact tables based on date or other relevant dimensions to improve query performance.
Data Compression: Use data compression to reduce storage space and improve I/O performance.
Materialized Views: Create materialized views for frequently used aggregations to avoid recalculating them every time. (Database specific - see below).
Avoid Complex Joins: Minimize the number of joins in your queries. Star schemas are designed to minimize joins.
Use Appropriate Data Types: Choose the correct data types for your columns. For example, use INT for integer values, DECIMAL for currency values, and DATE or TIMESTAMP for date and time values.
Regularly Update Statistics: Update database statistics to help the query optimizer choose the best execution plan.
Monitor Query Performance: Use query profiling tools to identify slow-running queries and optimize them.
Incremental Loading: Implement an incremental loading strategy to update the data warehouse with new data without reloading the entire dataset.
Data Quality: Ensure data quality through data validation and cleansing processes. Bad data in = bad results out.
Security: Implement proper security measures to protect sensitive data. Use role-based access control (RBAC) to restrict access to data based on user roles. Encrypt sensitive data at rest and in transit.

5. Examples

Example 1: Sales by Region and Product Category

SELECT
    c.state,
    p.category,
    SUM(f.sales_amount) AS total_sales
FROM
    fact_sales f
JOIN
    dim_customer c ON f.customer_key = c.customer_key
JOIN
    dim_product p ON f.product_key = p.product_key
GROUP BY
    c.state,
    p.category
ORDER BY
    c.state,
    total_sales DESC;

-- Sample Output:
-- state | category      | total_sales
-- ------|---------------|-------------
-- CA    | Electronics   | 15000.00
-- CA    | Books         | 5000.00
-- NY    | Electronics   | 12000.00
-- NY    | Clothing      | 3000.00

Example 2: Top 5 Customers by Sales Amount

SELECT
    c.customer_name,
    SUM(f.sales_amount) AS total_sales
FROM
    fact_sales f
JOIN
    dim_customer c ON f.customer_key = c.customer_key
GROUP BY
    c.customer_name
ORDER BY
    total_sales DESC
LIMIT 5;

-- Sample Output:
-- customer_name | total_sales
-- ---------------|-------------
-- John Doe      | 20000.00
-- Jane Smith      | 15000.00
-- David Lee     | 10000.00
-- Sarah Jones   | 8000.00
-- Michael Brown | 5000.00

Example 3: Year-over-Year Sales Growth

SELECT
    d.year,
    SUM(f.sales_amount) AS current_year_sales,
    LAG(SUM(f.sales_amount), 1, 0) OVER (ORDER BY d.year) AS previous_year_sales,
    (SUM(f.sales_amount) - LAG(SUM(f.sales_amount), 1, 0) OVER (ORDER BY d.year)) / LAG(SUM(f.sales_amount), 1, 0) OVER (ORDER BY d.year) AS year_over_year_growth
FROM
    fact_sales f
JOIN
    dim_date d ON f.date_key = d.date_key
GROUP BY
    d.year
ORDER BY
    d.year;

-- Sample Output:
-- year | current_year_sales | previous_year_sales | year_over_year_growth
-- -----|----------------------|-----------------------|-----------------------
-- 2022 | 100000.00            | 0.00                  | NULL
-- 2023 | 120000.00            | 100000.00             | 0.20
-- 2024 | 150000.00            | 120000.00             | 0.25

Example 4: Calculating Market Share by Product Category

WITH CategorySales AS (
    SELECT
        p.category,
        SUM(f.sales_amount) AS category_sales
    FROM
        fact_sales f
    JOIN
        dim_product p ON f.product_key = p.product_key
    GROUP BY
        p.category
),
TotalSales AS (
    SELECT SUM(category_sales) AS total_market_sales FROM CategorySales
)
SELECT
    cs.category,
    cs.category_sales,
    (cs.category_sales / ts.total_market_sales) AS market_share
FROM
    CategorySales cs, TotalSales ts
ORDER BY
    market_share DESC;

-- Sample Output:
-- category      | category_sales | market_share
-- ---------------|----------------|--------------
-- Electronics   | 27000.00       | 0.54
-- Clothing      | 15000.00       | 0.30
-- Books         | 8000.00        | 0.16

6. Common Pitfalls

Incorrect Data Modeling: Poorly designed dimension tables or fact tables can lead to slow query performance and inaccurate results. Spend time on proper dimensional modeling.
Missing Indexes: Missing indexes can significantly slow down query performance.
Over-Normalization: Over-normalizing dimension tables can increase the number of joins required, which can negatively impact performance. Consider denormalization where appropriate.
Ignoring Data Skew: Data skew can lead to uneven data distribution across partitions, which can negatively impact query performance. Consider using techniques like salting to address data skew.
Not Monitoring Query Performance: Failing to monitor query performance can lead to undetected performance issues.
Assuming Data is Clean: Always validate and cleanse data before loading it into the data warehouse.
Forgetting to Update Statistics: Stale statistics can lead to the query optimizer choosing suboptimal execution plans.
Using SELECT *: Avoid using SELECT * in your queries. Specify the columns you need to retrieve to improve performance.

Troubleshooting Tips:

Use EXPLAIN: Use the EXPLAIN statement to analyze the query execution plan and identify performance bottlenecks.
Check for Missing Indexes: Look for missing indexes on foreign key columns and frequently used filtering columns.
Analyze Query Performance: Use query profiling tools to identify slow-running queries and identify the areas that are consuming the most resources.
Review Data Model: Review the data model to ensure that it is properly designed and optimized for analytical queries.
Check Data Quality: Check the data quality to ensure that there are no data inconsistencies or errors.

7. Database Variations

Here’s how some of the concepts vary across different database systems:

Feature	MySQL	PostgreSQL	SQL Server	Oracle
Materialized Views	Available through 3rd party tools or triggers.	Native support. `CREATE MATERIALIZED VIEW`	Native support. `CREATE VIEW WITH SCHEMABINDING`	Native support. `CREATE MATERIALIZED VIEW`
Partitioning	Supports partitioning (horizontal).	Supports partitioning (declarative).	Supports partitioning (table and index).	Supports partitioning (table and index).
CUBE/ROLLUP	Supported	Supported	Supported	Supported
Window Functions	Supported	Supported	Supported	Supported
Data Compression	Available through storage engine features.	Available through extensions.	Native support.	Native support.
Statistics Updates	`ANALYZE TABLE`	`ANALYZE`	`UPDATE STATISTICS`	`DBMS_STATS.GATHER_TABLE_STATS`

Specific Examples of Database Differences:

MySQL Materialized Views: MySQL does not have native materialized views. You can simulate them using views and triggers or using third-party tools.
SQL Server Clustered Columnstore Indexes: SQL Server has Clustered Columnstore Indexes that are optimized for data warehousing workloads. These indexes store data in a columnar format, which can significantly improve query performance for analytical queries.
Oracle Partitioning: Oracle offers advanced partitioning features, including range partitioning, list partitioning, hash partitioning, and composite partitioning.
PostgreSQL Extensions: PostgreSQL has a rich ecosystem of extensions that can be used to enhance its data warehousing capabilities. For example, the pg_column_size extension can be used to estimate the size of columns in a table, which can be helpful for optimizing storage space.

This cheatsheet provides a comprehensive overview of data warehousing concepts, including OLAP and star schemas. It includes practical SQL examples, best practices, common pitfalls, and database variations to help you design and implement efficient and effective data warehousing solutions. Remember to tailor your approach to the specific needs of your project and the capabilities of your chosen database system.