Unit 4: Database Design & Query Processing

Relational Database Design

Normalization is a process used in database design to:

Minimize data redundancy.
Avoid update anomalies such as insertion, deletion, and modification anomalies.
Organize data efficiently, ensuring that data is stored only once, reducing duplication.

The goal is to split data into smaller tables that are logically related to reduce redundancy and ensure data integrity.

Data Redundancy: Occurs when the same piece of data is stored in multiple places. This can lead to inconsistencies when the data is updated.
Update Anomalies:
- Insertion Anomaly: Inability to add data to the database due to missing other required data.
- Deletion Anomaly: Unintended loss of data when trying to remove other data.
- Modification Anomaly: Inconsistencies that arise when data is updated in multiple places.

A functional dependency (FD) occurs when one attribute uniquely determines another attribute in a relation:

Denoted as A → B, meaning attribute A determines attribute B.
Helps identify redundancy and dependencies that can be eliminated through normalization.

A relation is in 1NF if it contains only atomic (indivisible) values, meaning each column contains unique values and there are no repeating groups or arrays.
Example: A table with repeating fields (like multiple phone numbers in one column) would not be in 1NF.

A relation is in 2NF if:
- It is in 1NF.
- Every non-prime attribute is fully dependent on the whole primary key.
Partial dependencies (where an attribute is dependent on a part of a composite key) must be eliminated.

A relation is in 3NF if:
- It is in 2NF.
- There are no transitive dependencies (i.e., non-key attributes should not depend on other non-key attributes).
Example: If a table has a non-key attribute A that determines another non-key attribute B, this violates 3NF and should be separated into another table.

A stricter version of 3NF.
A relation is in BCNF if:
- For every functional dependency A → B, A must be a superkey.
BCNF addresses certain situations where 3NF does not completely eliminate redundancy.

Query processing involves:

Translating high-level SQL queries into low-level operations that the database can execute.
Optimizing the query for efficient retrieval and manipulation of data.

The steps in query processing include:

The cost of a query is typically measured in terms of:

The goal is to minimize these costs to ensure efficient query execution.

Selection Operation: Filters rows based on a specified condition (e.g., SELECT * FROM Employees WHERE Age > 30;).
Join Operation: Combines rows from two or more tables based on a related column (e.g., JOIN, NATURAL JOIN, INNER JOIN, LEFT JOIN).

These operations are fundamental to retrieving and combining data from different tables efficiently.

Relational Expressions: SQL queries are internally translated into relational algebra expressions.
The evaluation of these expressions involves choosing the best algorithms for selection, join, and other operations (e.g., hash joins, merge joins).

Query optimizers estimate the cost of different query execution plans using:

Statistical data: Such as the number of rows in a table, the number of distinct values in a column, and the availability of indexes.
Cost models: Based on the number of disk accesses, memory usage, and CPU time for different query operations.

Query optimization involves transforming relational expressions into equivalent but more efficient forms.
Techniques used include:
- Reordering operations (e.g., pushing selections down the query tree).
- Using indexes to speed up access.
- Replacing subqueries with more efficient joins or set operations.

Optimizers often consider different ways to execute a query and choose the plan that minimizes the estimated cost.