Inside DuckDB: Columnar Storage and Vectorized Execution

TL;DR

DuckDB is an embeddable OLAP database that combines the PAX columnar storage layout with a vectorized execution engine, powered by Morsel-Driven parallelism for fast analytical queries.

Three-Layer Architecture

┌─────────────────────────────────────┐
│       Morsel-Driven Parallelism     │
│  ┌─────────┐ ┌─────────┐           │
│  │ Thread 1│ │ Thread 2│ ...       │
│  └────┬────┘ └────┬────┘           │
│       │           │                 │
│  ┌────▼───────────▼────┐           │
│  │  Vectorized Executor  │          │
│  │  (Vector Size = 2048)│           │
│  └──────────┬──────────┘           │
│             │                       │
│  ┌──────────▼──────────┐           │
│  │  PAX Columnar Layout │           │
│  └─────────────────────┘           │
└─────────────────────────────────────┘

PAX (Partition Attributes Across) Layout

DuckDB does not use pure columnar storage (DSM). Instead, it uses PAX: each Row Group is organized column-wise internally:

Row Group (122,880 rows)
┌──────────────────────────────────────┐
│ Column Chunk 1 (trip_id):            │
│ [val1, val2, ..., val122880]         │
├──────────────────────────────────────┤
│ Column Chunk 2 (timestamp):          │
│ [ts1, ts2, ..., ts122880]            │
├──────────────────────────────────────┤
│ Column Chunk 3 (amount):             │
│ [amt1, amt2, ..., amt122880]         │
└──────────────────────────────────────┘

Why PAX over Pure Columnar?

DSM (Decomposed Storage Model): one file per column → multi-column queries need multiple IOs
NSM (N-ary Storage Model): row-wise → scan drags in unused columns
PAX: the sweet spot — columns are contiguous within each Row Group → single IO for multiple columns

Vectorized Execution

DuckDB processes 2048 rows at a time (STANDARD_VECTOR_SIZE), rather than row-by-row:

Volcano (row-at-a-time):
  for row in table:
    col_a = row.a        ← function call × N
    col_b = row.b
    result = col_a + col_b

Vectorized (DuckDB):
  vector_a = table.a[0:2048]          ← single function call
  vector_b = table.b[0:2048]
  result = vector_add(vector_a, vector_b)  ← SIMD accelerated

Benefits:

Fewer virtual function calls (1 per 2048 rows vs 1 per row)
CPU cache-friendly (batch data stays in L1/L2)
Leverages SIMD (SSE/AVX) for bulk computation

Morsel-Driven Parallelism

// Data is partitioned into Morsels at execution time
// Each Morsel contains ~thousands of rows, dynamically assigned to threads

while (true) {
    auto morsel = pipeline->GetNextMorsel();
    if (!morsel) break;  // no more data
    ExecuteOperator(morsel);  // run on current thread
}

Unlike partition-based parallelism (each thread gets a fixed region), Morsel-Driven scheduling naturally handles data skew.

Performance Characteristics

Workload	DuckDB Advantage
Full table scan	Columnar scan, skip unused columns
Aggregation	SIMD-accelerated, vectorized batch aggregation
Join	Morsel-Parallel Hash Join
In-process analytics	Zero config, embedded deployment

Summary

DuckDB’s PAX columnar storage + vectorized execution + Morsel parallelism achieves near-large-OLAP performance in an embedded footprint. For data analysis and ETL, it’s the OLAP counterpart to what SQLite does for OLTP.