Why SQL and Relational Theory Exist: A Revolution in Data Independence

Introduction: A Deceptively Simple Question

Ask any software engineer “What is SQL?”, and they will likely produce SELECT * FROM users WHERE age > 18 without hesitation. But if you follow up with “Why was SQL created? What problems did it solve? What did databases look like before it?”, most will draw a blank.

Understanding the “why” of a technology is far more important than knowing the “how.” This article attempts to answer that fundamental question: What was the historical motivation behind relational database theory and SQL, and what core problems did they actually solve?

The Pre-Relational Era: Chaos in Data Management

The State of Databases in the 1960s

Before Codd published his seminal 1970 paper “A Relational Model of Data for Large Shared Data Banks,” data management was in a fairly “primitive” state.

Database systems at the time fell into two major camps:

1. The Hierarchical Model — IBM IMS

The hierarchical model organized data as tree structures. Each record had exactly one parent, and records were linked via pointers (physical addresses).

Department (D01, "Engineering")
  ├── Employee (E001, "Alice", 80000)
  │     ├── Child (C001, "Bob", 10)
  │     └── Child (C002, "Carol", 7)
  └── Employee (E002, "Dave", 75000)
        └── Child (C003, "Eve", 5)

Seemingly intuitive, but deeply flawed:

Asymmetric query traps: Navigating from Department to Employee was easy (follow the tree down), but going from Employee to Department was difficult (requiring reverse traversal or auxiliary indexes)
Data redundancy: If an employee belonged to multiple departments, the employee record had to be duplicated under each tree
Physical dependency: Applications had to know the physical storage paths — pointers, disk block addresses, and so on

2. The Network Model — CODASYL DBTG

The network model attempted to remedy the hierarchical model’s limitations by allowing many-to-many relationships between arbitrary nodes. Data was connected via “sets,” each with an owner record and multiple member records.

           Supplier (S1, "Acme")
           /        \
          /          \
 (S1-P1 supply)   (S1-P2 supply)
        /                \
  Part (P1, "Bolt")   Part (P2, "Nut")
        \                /
         \              /
     (P1-S2 supply)  (P2-S2 supply)
                \    /
           Supplier (S2, "BestCo")

While more flexible than the hierarchical model, the network model had fatal flaws:

Navigational access: Programmers had to traverse data step-by-step through pointers like rats in a maze, using commands like FIND NEXT IN SET
Brittle programs: Any structural change (such as adding a new set type) would break every application program that traversed that structure
No declarative queries: Finding “all suppliers that supply bolts” required dozens of lines of navigational code, not a single declarative statement

The Fundamental Tension: Programs Tightly Coupled to Data

Looking back at the pre-relational era, the problem can be distilled into a single core tension: Applications were deeply coupled to the physical storage structure of the data.

Codd observed sharply in his 1970 paper:

“Future users of large data banks must be protected from having to know how the data is organized in the machine.”

The reality at the time was the exact opposite:

Change storage structure → must rewrite applications
Add a new index → must modify query logic
Reorganize data → everything breaks

This was the fundamental problem Codd set out to solve: data independence.

Codd’s Revolution: The Relational Model

The Core Insight: Replace Pointers with Mathematics

Codd’s genius lay in realizing that a simple mathematical abstraction — the relation — could completely replace the complex web of pointers.

A relation is an n-ary relation from set theory, formally defined as:

Given sets $D_1, D_2, \ldots, D_n$ (called domains), a relation $R$ is a subset of the Cartesian product:

$R \subseteq D_1 \times D_2 \times \cdots \times D_n$

In database terms:

Each $D_i$ is a column type (e.g., INTEGER, VARCHAR)
$R$ is a table
Each element $(d_1, d_2, \ldots, d_n)$ in the Cartesian product is a row (or tuple)

This seemingly simple definition had revolutionary consequences.

The Five Pillars of the Relational Model

1. Data Independence

This was Codd’s central demand. In the relational model, users see tables at the logical level, not pointers at the physical level. You can freely change the underlying indexes, storage layout, and file organization without modifying any application code.

An analogy: the relational model is to the hierarchical/network model what high-level languages are to assembly. You express what you want with SELECT, rather than specifying how to get it with pointer traversal.

2. The Closure Property

Every operator in relational algebra takes relations as input and produces a relation as output. This means relational algebra expressions can be arbitrarily nested:

$\pi_{name}(\sigma_{age > 18}(\text{Employee} \bowtie \text{Department}))$

The closure property is the theoretical foundation for declarative query languages — because the result of every subexpression is still a relation, you can compose arbitrarily complex queries like building blocks.

3. Data Integrity

The relational model provides two built-in consistency mechanisms:

Entity integrity: Every relation must have a primary key; primary keys cannot be NULL
Referential integrity: Foreign keys must reference existing primary key values

In the pre-relational era, such constraints were entirely enforced by application code, making errors inevitable. The relational model elevated these constraints to the database system level, providing global guarantees.

4. Eliminating Redundancy — Normalization

Codd proposed the first, second, and third normal forms (later extended by Boyce and others to BCNF). The core idea:

Decompose data into normalized structures following “one fact, one place,” avoiding insertion anomalies, deletion anomalies, and update anomalies.

For example, taking a student table with embedded course information:

student_id	name	course_id	course_name	instructor
001	Alice	CS101	Database	Dr. Smith
001	Alice	CS202	Algorithms	Dr. Jones
002	Bob	CS101	Database	Dr. Smith

Decomposed into three independent normalized relations:

Students: (student_id, name)
Courses: (course_id, course_name, instructor)
Enrollments: (student_id, course_id)

Now changing a course name requires updating just one row, with no risk of inconsistency.

5. Declarative Querying — Relational Calculus

Codd proposed both relational algebra and relational calculus. Relational calculus is based on first-order predicate logic, allowing users to declaratively describe the data they want:

$\{t \mid t \in \text{Employee} \land t.\text{age} > 18\}$

This is equivalent to the SQL SELECT * FROM Employee WHERE age > 18. The user declares what they want; the system decides how to get it.

Codd proved the equivalence of relational algebra and relational calculus, establishing the theoretical foundation for query optimizers — which can rewrite declarative queries into equivalent, more efficient relational algebra expressions.

The Birth of SQL: Theory Standardized into Engineering Practice

System R and SEQUEL

In 1974, four years after Codd’s paper, Don Chamberlin and Ray Boyce at IBM’s San Jose Research Laboratory published “SEQUEL: A Structured English Query Language.” SEQUEL was later renamed SQL due to trademark issues.

Their stated design goal was clear:

“To provide a way for users without a background in mathematics or computer programming to access a relational database.”

The System R project was IBM’s first relational database prototype. It validated Codd’s theory in engineering practice:

A cost-based query optimizer — proving declarative queries could be executed efficiently
Transaction management (ACID) — proving the feasibility of concurrent multi-user access
Views — delivering on the promise of logical data independence

The ACID properties deserve special attention because they are not part of the original relational model but emerged from System R’s engineering work:

Atomicity: A transaction is all-or-nothing
Consistency: Every transaction moves the database from one consistent state to another
Isolation: Concurrent transactions do not interfere with each other
Durability: Once committed, a transaction’s effects survive system crashes

ACID transformed the relational database from a theoretical model into a practical system of record for the enterprise.

Why “Structured English” Instead of Pure Mathematical Notation?

This was a critical design decision. Relational calculus, while elegant, made expressions like $\{t \mid \exists u \in R(t.a = u.b \land u.c > 100)\}$ far too difficult for ordinary users.

SQL chose a natural-language-like syntax:

SELECT t.*
FROM R AS t, R AS u
WHERE t.a = u.b AND u.c > 100;

This choice dramatically lowered the barrier to entry and was key to SQL’s widespread adoption. But it also planted the seeds of SQL’s deviations from the pure relational model (allowing duplicate rows, three-valued logic, etc.), which C. J. Date has exhaustively criticized in SQL and Relational Theory.

Why It Remains Irreplaceable

1. Data Independence Is Still a Hard Requirement

Whether migrating from MySQL to TiDB, or moving from a single node to a distributed system — as long as the interface remains SQL, application code barely needs to change. The value of this abstraction layer cannot be overstated.

2. Declarative Queries Are the Endgame Design

Imperative database access (early CODASYL navigational queries, certain modern NoSQL APIs) might be more flexible in simple scenarios, but once query logic becomes complex, the declarative advantage becomes infinitely greater.

Consider this contrast:

Navigational (pseudo-code):

for each department:
    for each employee in department:
        if employee.salary > avg(department.salary):
            print employee.name

Declarative (SQL):

SELECT e.name
FROM employee e
JOIN (
    SELECT dept_id, AVG(salary) AS avg_sal
    FROM employee
    GROUP BY dept_id
) d ON e.dept_id = d.dept_id
WHERE e.salary > d.avg_sal;

When requirements change (adding partition filters, multi-table joins), the declarative change is far smaller than the imperative one.

3. The Value of Optimizer Freedom

Declarative queries leave enormous optimization headroom for the query optimizer. Over the past 50 years, the database community has probably invested more research in query optimization than in any other single database sub-field:

From System R’s cost-based optimizer
To the Volcano model and vectorized execution
To massively parallel execution in modern analytical systems
To machine-learning-based query optimization

The beneficiary of all this optimization is the original SELECT ... WHERE .... That is the power of abstraction.

4. Transactions and Consistency Guarantees

SQL databases tightly integrated transaction concepts with the relational model. This made relational databases the default choice for any system of record — financial systems, ERP, e-commerce order processing, any scenario where data loss is unacceptable.

Boundaries and Reflections on the Relational Model

Acknowledging the relational model’s greatness requires recognizing its limitations as well:

1. Data Model Uniformity Not all data fits naturally in tables. Graph-structured data (social networks), document data (JSON), and time-series data (sensor readings) are either unnatural or poor-performing when modeled relationally.

2. Schema Rigidity Strong typing and predefined schemas are the source of the relational model’s data integrity guarantees, but they also mean a lack of flexibility. This is the reason document databases like MongoDB exist — in schema-uncertain scenarios, the document model is more convenient.

3. Distributed Scaling Challenges The relational model’s ACID semantics are extremely difficult to implement in distributed environments (as articulated by the CAP theorem). This spawned both the NewSQL movement (CockroachDB, TiDB) and the NoSQL movement.

But here is the interesting part: most NoSQL databases eventually “grew” SQL interfaces (Cassandra’s CQL, MongoDB’s aggregation pipeline, even Redis’s SQL-like queries). This is not a coincidence — declarative data access is the Pareto-optimal solution for both expressive clarity and system optimization headroom.

Conclusion: The Legacy of an 11-Page Paper

In 1970, Edgar F. Codd published an 11-page paper. It proposed no new algorithms, invented no new data structures, and didn’t even include performance comparisons. It did just one thing: it redefined how humans interact with data, using a clean and elegant mathematical framework.

Over 50 years later, we are still:

Querying with SELECT — declarative thinking inherited from relational calculus
Joining with JOIN — the natural join from relational algebra
Using primary keys and foreign keys — entity integrity and referential integrity
Designing table structures with normal forms — normalization theory

Every time you write a line of SQL, you are paying unconscious tribute to a great theory.

References

Codd, E. F. (1970). “A Relational Model of Data for Large Shared Data Banks.” Communications of the ACM, 13(6), 377–387.
Chamberlin, D. D., & Boyce, R. F. (1974). “SEQUEL: A Structured English Query Language.” Proceedings of the 1974 ACM SIGFIDET Workshop.
Date, C. J. (2009). SQL and Relational Theory: How to Write Accurate SQL Code. O’Reilly Media.
Astrahan, M. M., et al. (1976). “System R: Relational Approach to Database Management.” ACM Transactions on Database Systems, 1(2), 97–137.
Stonebraker, M., & Hellerstein, J. M. (2005). Readings in Database Systems (4th ed.). MIT Press.
Codd, E. F. (1972). “Further Normalization of the Data Base Relational Model.” Data Base Systems, Courant Computer Science Symposia Series, Vol. 6.