Reading Notes on SQL and Relational Theory

Chapter 1: Setting the Scene

Excerpts

The closure property in relational algebra is what allows us to write nested relational expressions, because the output of every operator is of the same type as its input — the output of one operator can serve as the input of another.

Because the model is separated from implementation, we gain physical data independence.

The number of attributes is called the “degree”, sometimes also referred to as “arity”. The number of tuples in the body is called the “cardinality”.

The tuples of a relation are completely unordered. This property holds because the body is defined as a set, and the elements of a mathematical set are unordered. Likewise, the attributes of a relation have no left-to-right order, because the heading is also a mathematical set.

A subset of a tuple is still a tuple. Every subset of a heading is still a heading. A heading is a set of n attributes.

The operators of relational algebra allow us to start with certain given relations and derive further relations from them. These given relations are called “base relations”, and the derived ones are called “derived relations”. In SQL, the counterpart of a base relation is a base table.

Notes & Reflections

A relation is a mathematically unordered set of tuples, where each tuple is composed of a set of attributes (columns):

Attributes have no order
Tuples have no order

Differences between SQL and the relational model:

Duplicate rows: SQL permits duplicate tuples, violating the set-based definition of the relational model
NULL values: SQL’s three-valued logic (true, false, unknown) undermines the relational model’s two-valued logic

In mathematics, a collection that allows duplicate elements is called a multiset. For example: the multiset {a, a, b, c}.

Chapter 2: Types and Domains

Excerpts

The relational model requires every type to support equality.

In essence, a type is a named, finite set of values — that is, all possible values of a particular kind.

Every type is associated with a set of operators for manipulating values and variables of that type. Relations have relational algebra operators, and every type has the assignment operator (:=) and the equality comparison operator (=). A type without operators is meaningless.

In the computing field, a widely accepted principle is that implicit type conversions should be avoided whenever possible, because they are error-prone. In SQL in particular, a bizarre consequence of allowing implicit type conversions is that certain union, intersection, and difference operations can produce rows that did not appear in any operand.

postgres=# create table ta(x integer, y numeric(5,1));
CREATE TABLE
postgres=# create table tb(x numeric(5,1), y integer);
CREATE TABLE

postgres=# insert into ta values (0,1.0),(0,2.0);
INSERT 0 2
postgres=# select * from ta;
 x |  y  
---+-----
 0 | 1.0
 0 | 2.0
(2 rows)

postgres=# insert into tb values (0.0,0),(0.0,1),(1.0,2);
INSERT 0 3
postgres=# select * from tb;
  x  | y 
-----+---
 0.0 | 0
 0.0 | 1
 1.0 | 2
(3 rows)

-- Results may vary across different database implementations
postgres=# select x,y from ta union select x,y from tb;
  x  |  y  
-----+-----
 0.0 |   0
   0 | 1.0
   0 | 2.0
 1.0 |   2
(4 rows)

For any given string: a) It is composed of an associated character set b) It has an associated collation. A collation is a set of rules associated with a particular character set, also known as a collating sequence. It determines how strings within that character set are compared.

Notes & Reflections

Scalar: non-decomposable, structureless, represents a single value, e.g., integers, floating-point numbers
Non-scalar: a data structure composed of multiple scalar values, possessing internal structure, decomposable into multiple scalars, e.g., arrays, sets

Chapter 3: Tuples, Relations, Rows, Tables

From a mathematical perspective, a set of tuples is a relation. The body of a relation is a set (a set of tuples).

Every subset of the body is still a body. Generally speaking, every subset of a relation is still a relation.

Chapter 4: No Duplicates, No Nulls

Strong recommendations:

Specify NOT NULL for every column of every base table.
Do not use outer joins.

Chapter 5: Base Relvars and Base Tables

A relation variable (relvar) is a variable whose permitted values are relations.

Candidate key definition: Let K be a subset of the heading of relvar R. Then K is a candidate key for R if and only if it possesses the following two properties:

Uniqueness: No two distinct tuples in R have the same value for K.
Irreducibility: No proper subset of K also possesses the uniqueness property.

Strong recommendation: In SQL, for every base table, always use a UNIQUE or PRIMARY KEY specification to ensure that each table has at least one key.

Most other models possess a certain degree of ad-hoc specificity, unlike the relational model, which is built upon set theory and predicate logic.

Chapter 6: SQL and Relational Algebra I: The Primitive Operators

Let r be a relation. Then:

The projection of r over all of its attributes returns r; such a projection is called the identity projection.

Joinability: Relations r1 and r2 are joinable if and only if attributes with the same name are of the same type (i.e., they are in fact the very same attribute), or equivalently, if and only if the result of taking the set-theoretic union of the headings of r1 and r2 is itself a legal heading.

A join followed by a restriction can always be transformed into a restriction followed by a join. In relational algebra, restriction distributes over intersection, union, and difference. In most cases, doing the restriction first is a good idea, as it generally:

Reduces the number of tuples that must be scanned in the next step of the operation sequence
Reduces the number of tuples output by the next operation

Chapter 7: SQL and Relational Algebra II: Additional Operators

In the relational model, an aggregate operator derives an “aggregate” of the values of some attribute from some relation.

Chapter 8: SQL and Constraints

An integrity constraint is essentially a boolean expression that must evaluate to TRUE. In essence, such constraints fall into two broad categories: type constraints and database constraints.

Type constraints define the values that constitute a given type.
Database constraints further restrict the values that may appear in a particular database.

A database should be a representation of some aspect of the enterprise, and this representation should be as faithful as possible, so as to ensure that correct decisions are made based on the contents of the database. Constraints are the best tool for ensuring that the database representation is as faithful as possible.

When defining a type, one thing we must do is specify the values that constitute that type — this is precisely what a type constraint does.

Transaction ACID:

Atomicity: A transaction is either fully performed or not performed at all.
Consistency: Every transaction takes the database from one consistent state to another consistent state, without necessarily preserving consistency at any intermediate point. Note that a database state is consistent if and only if it satisfies all defined constraints.
Isolation: The updates of any transaction are hidden from all other transactions until the transaction commits.
Durability: Once a transaction commits successfully, its updates persist in the database even if the system subsequently crashes.

Semantic optimization means using constraints to simplify queries, thereby improving performance.

Chapter 9: SQL and Views

Excerpts

Physical data independence: refers to our ability to change the way data is physically stored and accessed without changing the way users perceive the data. Logical data independence: refers to our ability to change the way data is logically stored and accessed without changing the way users perceive the data. And it is views that are supposed to provide logical data independence.

Snapshots are important in data warehousing, distributed systems, and many other contexts. In these contexts, applications can often tolerate (and sometimes even require) data to remain unchanged from a certain point in time onward. Reporting and accounting applications are prime examples. Such applications typically require data to be frozen at a given moment, and snapshots allow this freezing to occur without locking out other applications.

View operations should be implemented by mapping them into operations on the appropriate underlying relvars.

Chapter 10: SQL and Logic

What is relational calculus? In essence, it is an applied form of predicate calculus (also known as predicate logic), tailored specifically for relational databases.

Relational completeness is a fundamental measure of a language’s expressive power: if a language is relationally complete, it means that arbitrarily complex queries can be expressed without resorting to any iterative loops or branching.

Chapter 11: Using Logic to Formulate SQL Expressions

Restriction distributes over union, intersection, and difference. Projection distributes over union, but not over intersection or difference.

Chapter 12: Miscellaneous SQL Topics

An implementation-defined feature is one whose semantics can vary from one implementation to another, but whose characteristics must at least be specified in every implementation. In other words, an implementation may decide for itself how it implements the feature, but the result of that decision must be documented.

The use of SELECT * should be avoided, because the meaning of * changes when new columns are added to an existing table. It is recommended to explicitly specify the relevant columns by name.

Range variable: In the relational model, a range variable is a variable that ranges over the set of tuples in some relation (or, in SQL terms, the set of rows in some table).

A subquery in SQL is a table expression enclosed in parentheses. Subqueries fall into three broad categories:

Table subquery: a subquery that is neither a row subquery nor a scalar subquery.
Row subquery: a subquery that appears in a position where a row expression would be expected.
Scalar subquery: a subquery that appears in a position where a scalar expression would be expected.

A correlated subquery is a special type of table, row, or scalar subquery. Specifically, it is a subquery that contains a reference to an “outer” table.

An empty set is a set that contains no elements.

Notes & Reflections

Core value of views: logical data independence, simplified queries, access control.

View updatability: Some views are updatable, others are not, depending on how the view is defined and on the underlying DBMS support.

From the perspective of relational database theory, for a view to support update operations, a series of strict conditions must be satisfied. The typical theoretical conditions for a view to be updatable include:

Based on a single base table: the view’s FROM clause contains only one base table (no joins involving multiple tables).
Includes a candidate key: the view’s query result must include the primary key (or another candidate key) of the underlying base table. This is necessary to uniquely identify the row to be modified.
No aggregation: GROUP BY, HAVING, or aggregate functions (such as SUM, COUNT, AVG, etc.) must not be used.
No set operations: UNION, UNION ALL, INTERSECT, EXCEPT, etc. must not be used.
No DISTINCT keyword: ensures that rows are unique and traceable.
No computed columns: if a view’s column is derived from an expression or function, it is generally not directly updatable. The update operation must be able to determine unambiguously how to modify the original column in the base table.

For views that are theoretically non-updatable, the commonly adopted approach across databases today is: the INSTEAD OF trigger. Principle: you can define a trigger on a view that tells the database, whenever someone attempts an INSERT, UPDATE, or DELETE operation on the view, to execute custom SQL logic instead of the default operation.

Appendix A: The Relational Model

A database must be relational.

In other words, common non-relational databases are merely application-specific data stores — one could even say they hardly qualify as databases at all.

In precise mathematical terms, a set of ordered pairs is a binary relation. The definition is as follows: A binary relation over two sets A and B is a subset of the Cartesian product of A and B. In other words, a binary relation is a set of ordered pairs (a, b) such that the first element a comes from A and the second element b comes from B.

The relational model consists of five components:

An extensible collection of scalar types, including in particular the BOOLEAN type.
A relation type generator and the intended interpretation of relations of the types so generated.
Facilities for defining relation types generated by the aforementioned generator.
A relation assignment operator for assigning relation values to relation variables.
A relationally complete, general-purpose set of relational operators for deriving relation values.

The following is the scientific method:

Empirically observe a particular phenomenon
Construct a theory or hypothesis to explain the phenomenon
Use the theory to make predictions
Test the accuracy of the predictions
Refine the theory based on the test results
Iterate as above

Appendix B: SQL Departures from the Relational Model

SQL often does not treat views as tables.
SQL’s support for view updating is very weak, ad-hoc, and incomplete.
SQL is based on three-valued logic, whereas the relational model is based on two-valued logic.
…

Appendix C: A Relational Approach to Missing Information

Notes & Reflections

The primary method for handling missing information in the SQL standard is NULL values. Advantages: simple and intuitive, storage-efficient (no additional storage overhead), widely supported. Disadvantages: the complexity of three-valued logic (true, false, unknown complicates query logic), semantic ambiguity (unable to distinguish between different meanings such as “unknown”, “inapplicable”, etc.).