Database Normalization and Design

Difficulty: Advanced
Generated on: 2025-07-10 02:33:24
Category: SQL Cheatsheet for Database Development

SQL Cheatsheet: Database Normalization and Design (Advanced)

This cheatsheet provides a comprehensive overview of advanced database normalization and design principles, focusing on practical SQL examples and best practices for database developers.

1. Quick Overview

What is it? Database normalization is the process of organizing data in a database to reduce redundancy and improve data integrity. Database design encompasses the overall structure and organization of a database, considering factors like data relationships, performance, and scalability.
When to use it? Always! Normalization is fundamental to good database design. Aim for at least 3NF. Consider higher normal forms for complex scenarios. Database design principles are applied during initial database creation, schema modifications, and performance tuning.

2. Syntax

Creating Tables:

CREATE TABLE table_name (
    column1 datatype constraints,
    column2 datatype constraints,
    ...,
    CONSTRAINT pk_table_name PRIMARY KEY (column1), -- Primary Key
    CONSTRAINT fk_table_name FOREIGN KEY (column2) REFERENCES other_table(other_column) -- Foreign Key
);

Adding Constraints:

ALTER TABLE table_name
ADD CONSTRAINT constraint_name constraint_type (column_name);

-- Example: Adding a unique constraint
ALTER TABLE customers
ADD CONSTRAINT unique_email UNIQUE (email);

Indexes:

CREATE INDEX index_name
ON table_name (column1, column2);

-- Unique Index (enforces uniqueness and speeds up lookups)
CREATE UNIQUE INDEX unique_email_idx
ON customers (email);

-- Composite Index (improves performance on queries using multiple columns)
CREATE INDEX idx_order_customer_date
ON orders (customer_id, order_date);

3. Common Use Cases

Removing Redundancy: Storing customer address information only once, in a separate addresses table, and linking it to the customers table via a foreign key.
Ensuring Data Integrity: Using foreign keys to enforce relationships between tables, preventing orphaned records. For example, an order record must always have a valid customer_id.
Improving Query Performance: Creating indexes on frequently queried columns to speed up data retrieval.
Simplifying Data Updates: Modifying customer address information in one location (addresses table) ensures consistency across the entire database.
Implementing Auditing: Adding audit columns (e.g., created_at, updated_at, created_by, updated_by) to track changes to data.
Handling Many-to-Many Relationships: Using a junction table to resolve many-to-many relationships between two entities (e.g., students and courses using a student_courses table).
Implementing Versioning: Adding start and end dates to a table to track changes over time (e.g., for product pricing or employee salaries).

4. Best Practices

Normalization Levels:
- 1NF (First Normal Form): Eliminate repeating groups of data. Each column should contain atomic values (indivisible).
- 2NF (Second Normal Form): Be in 1NF and eliminate redundant data. Non-key attributes must be fully functionally dependent on the entire primary key.
- 3NF (Third Normal Form): Be in 2NF and eliminate transitive dependency. Non-key attributes must not depend on other non-key attributes.
- BCNF (Boyce-Codd Normal Form): A stricter version of 3NF. Every determinant must be a candidate key.
- 4NF (Fourth Normal Form): Eliminate multi-valued dependencies.
- 5NF (Fifth Normal Form): Eliminate join dependencies.
Denormalization: Occasionally, denormalization (intentionally introducing redundancy) can improve read performance in specific scenarios, but it should be done cautiously and with careful consideration of the trade-offs.
Data Types: Choose appropriate data types for each column to minimize storage space and improve performance. Use VARCHAR instead of TEXT if the maximum length is known. Use INT instead of BIGINT if the range of values is small enough.
Indexing: Create indexes on frequently queried columns, but avoid over-indexing, as it can slow down write operations. Regularly review and optimize indexes.
Foreign Keys: Always use foreign keys to enforce relationships between tables and maintain data integrity. Consider using ON DELETE CASCADE or ON DELETE SET NULL to handle orphaned records when parent records are deleted (with caution).
Constraints: Use constraints (e.g., NOT NULL, UNIQUE, CHECK) to enforce data validation rules.
Naming Conventions: Use consistent naming conventions for tables, columns, and constraints to improve readability and maintainability. Examples: customers, customer_id, fk_orders_customer.
Stored Procedures and Functions: Use stored procedures and functions to encapsulate complex logic and improve code reusability.
Views: Use views to simplify complex queries and provide a virtual representation of data.
Data Partitioning: For large tables, consider data partitioning to improve query performance and manageability.
Query Optimization: Use EXPLAIN (or equivalent) to analyze query execution plans and identify potential performance bottlenecks. Rewrite queries to improve performance.
Connection Pooling: Use connection pooling to reduce the overhead of establishing database connections.
Transaction Management: Use transactions to ensure data consistency and atomicity.
Security: Implement proper security measures to protect sensitive data, including access control, encryption, and data masking. Avoid storing sensitive data in plain text. Use parameterized queries to prevent SQL injection attacks.

5. Examples

Example 1: Normalizing Customer and Address Information

Unnormalized Table (Customers):

CREATE TABLE Customers (
    CustomerID INT PRIMARY KEY,
    Name VARCHAR(255),
    Address VARCHAR(255),
    City VARCHAR(255),
    State VARCHAR(255),
    ZipCode VARCHAR(10)
);

Normalized Tables (Customers and Addresses):

CREATE TABLE Customers (
    CustomerID INT PRIMARY KEY,
    Name VARCHAR(255),
    AddressID INT,
    FOREIGN KEY (AddressID) REFERENCES Addresses(AddressID)
);

CREATE TABLE Addresses (
    AddressID INT PRIMARY KEY AUTO_INCREMENT, -- Add AUTO_INCREMENT for new databases
    Address VARCHAR(255),
    City VARCHAR(255),
    State VARCHAR(255),
    ZipCode VARCHAR(10)
);

Sample Data:

-- Customers Table
INSERT INTO Customers (CustomerID, Name, AddressID) VALUES
(1, 'John Doe', 1),
(2, 'Jane Smith', 2);

-- Addresses Table
INSERT INTO Addresses (AddressID, Address, City, State, ZipCode) VALUES
(1, '123 Main St', 'Anytown', 'CA', '91234'),
(2, '456 Oak Ave', 'Springfield', 'IL', '62704');

Benefits: Reduces redundancy, simplifies address updates, and allows multiple customers to share the same address.

Example 2: Resolving a Many-to-Many Relationship (Students and Courses)

CREATE TABLE Students (
    StudentID INT PRIMARY KEY,
    Name VARCHAR(255)
);

CREATE TABLE Courses (
    CourseID INT PRIMARY KEY,
    CourseName VARCHAR(255)
);

CREATE TABLE StudentCourses (
    StudentID INT,
    CourseID INT,
    PRIMARY KEY (StudentID, CourseID),
    FOREIGN KEY (StudentID) REFERENCES Students(StudentID),
    FOREIGN KEY (CourseID) REFERENCES Courses(CourseID)
);

Sample Data:

INSERT INTO Students (StudentID, Name) VALUES
(1, 'Alice'),
(2, 'Bob');

INSERT INTO Courses (CourseID, CourseName) VALUES
(101, 'Math'),
(102, 'Science');

INSERT INTO StudentCourses (StudentID, CourseID) VALUES
(1, 101), -- Alice takes Math
(1, 102), -- Alice takes Science
(2, 102);  -- Bob takes Science

Example 3: Implementing Auditing

CREATE TABLE Products (
    ProductID INT PRIMARY KEY,
    ProductName VARCHAR(255),
    Price DECIMAL(10, 2),
    CreatedAt DATETIME DEFAULT CURRENT_TIMESTAMP,
    UpdatedAt DATETIME DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    CreatedBy VARCHAR(255),
    UpdatedBy VARCHAR(255)
);

Explanation: CreatedAt and CreatedBy store the creation timestamp and user. UpdatedAt automatically updates the timestamp on any update, and UpdatedBy stores the user who updated the record. CURRENT_TIMESTAMP is database-specific. MySQL uses CURRENT_TIMESTAMP, PostgreSQL uses now(), SQL Server uses GETDATE().

Example 4: Using CHECK Constraints

CREATE TABLE Employees (
    EmployeeID INT PRIMARY KEY,
    Salary DECIMAL(10, 2),
    CONSTRAINT chk_salary CHECK (Salary >= 0)  -- Salary must be non-negative
);

Example 5: Versioning using Start and End Dates

CREATE TABLE ProductPrices (
    ProductID INT,
    Price DECIMAL(10, 2),
    StartDate DATE,
    EndDate DATE,
    PRIMARY KEY (ProductID, StartDate),
    FOREIGN KEY (ProductID) REFERENCES Products(ProductID)
);

Querying for current price:

SELECT ProductID, Price
FROM ProductPrices
WHERE ProductID = 123
AND StartDate <= CURRENT_DATE
AND EndDate >= CURRENT_DATE;

6. Common Pitfalls

Over-Normalization: Normalizing too aggressively can lead to complex queries and performance issues.
Under-Normalization: Not normalizing enough can lead to data redundancy, inconsistencies, and update anomalies.
Ignoring Data Types: Using inappropriate data types can waste storage space and negatively impact performance.
Over-Indexing: Creating too many indexes can slow down write operations.
Ignoring Null Values: Not handling null values properly can lead to unexpected query results. Consider the impact of nulls on indexes.
SQL Injection: Failing to sanitize user input can make your database vulnerable to SQL injection attacks. Always use parameterized queries or prepared statements.
Lack of Transactions: Not using transactions can lead to data inconsistencies in the event of errors.
Poorly Designed Primary Keys: Choosing a poor primary key can negatively impact performance. Use surrogate keys (e.g., auto-incrementing integers) when appropriate.
Not Considering Scalability: Designing a database without considering scalability can lead to performance bottlenecks as the data grows.
Forgetting Backup and Recovery: Not having a proper backup and recovery plan can lead to data loss in the event of a disaster.

7. Database Variations

Feature	MySQL	PostgreSQL	SQL Server	Oracle
Auto Increment	`AUTO_INCREMENT`	`SERIAL` or `GENERATED ALWAYS AS IDENTITY`	`IDENTITY(1,1)`	`SEQUENCE` and `TRIGGER`
Current Timestamp	`CURRENT_TIMESTAMP`	`now()`	`GETDATE()`	`SYSTIMESTAMP`
String Concatenation	`CONCAT(str1, str2)`	`str1		str2`
Date/Time Functions	Vary, but generally similar (e.g., `DATE_ADD`, `DATE_SUB`)	More comprehensive set of functions	Often similar to MySQL, but check documentation	Very extensive set of built-in functions
String Length	`LENGTH(str)`	`LENGTH(str)`	`LEN(str)`	`LENGTH(str)`
Parameterized Queries	Varies by client library	Varies by client library	Varies by client library	Varies by client library
`LIMIT`	`LIMIT 10 OFFSET 20`	`LIMIT 10 OFFSET 20`	`TOP 10` (with `ORDER BY` for offset)	`ROWNUM <= 30` (with subquery for offset)

Key Takeaways:

Normalization is a process, not a destination. You might not always need to achieve the highest normal form.
Understand the trade-offs between normalization and performance.
Prioritize data integrity and security.
Test your database design thoroughly.
Continuously monitor and optimize your database.
Document your database schema.
Stay up-to-date with the latest database technologies and best practices.