TL;DR
DuckDB is an embeddable OLAP database that combines the PAX columnar storage layout with a vectorized execution engine, powered by Morsel-Driven parallelism for fast analytical queries.
Three-Layer Architecture
┌─────────────────────────────────────┐
│ Morsel-Driven Parallelism │
│ ┌─────────┐ ┌─────────┐ │
│ │ Thread 1│ │ Thread 2│ ... │
│ └────┬────┘ └────┬────┘ │
│ │ │ │
│ ┌────▼───────────▼────┐ │
│ │ Vectorized Executor │ │
│ │ (Vector Size = 2048)│ │
│ └──────────┬──────────┘ │
│ │ │
│ ┌──────────▼──────────┐ │
│ │ PAX Columnar Layout │ │
│ └─────────────────────┘ │
└─────────────────────────────────────┘
PAX (Partition Attributes Across) Layout
DuckDB does not use pure columnar storage (DSM). Instead, it uses PAX: each Row Group is organized column-wise internally:
Row Group (122,880 rows)
┌──────────────────────────────────────┐
│ Column Chunk 1 (trip_id): │
│ [val1, val2, ..., val122880] │
├──────────────────────────────────────┤
│ Column Chunk 2 (timestamp): │
│ [ts1, ts2, ..., ts122880] │
├──────────────────────────────────────┤
│ Column Chunk 3 (amount): │
│ [amt1, amt2, ..., amt122880] │
└──────────────────────────────────────┘
Why PAX over Pure Columnar?
- DSM (Decomposed Storage Model): one file per column → multi-column queries need multiple IOs
- NSM (N-ary Storage Model): row-wise → scan drags in unused columns
- PAX: the sweet spot — columns are contiguous within each Row Group → single IO for multiple columns
Vectorized Execution
DuckDB processes 2048 rows at a time (STANDARD_VECTOR_SIZE), rather than row-by-row:
Volcano (row-at-a-time):
for row in table:
col_a = row.a ← function call × N
col_b = row.b
result = col_a + col_b
Vectorized (DuckDB):
vector_a = table.a[0:2048] ← single function call
vector_b = table.b[0:2048]
result = vector_add(vector_a, vector_b) ← SIMD accelerated
Benefits:
- Fewer virtual function calls (1 per 2048 rows vs 1 per row)
- CPU cache-friendly (batch data stays in L1/L2)
- Leverages SIMD (SSE/AVX) for bulk computation
Morsel-Driven Parallelism
// Data is partitioned into Morsels at execution time
// Each Morsel contains ~thousands of rows, dynamically assigned to threads
while (true) {
auto morsel = pipeline->GetNextMorsel();
if (!morsel) break; // no more data
ExecuteOperator(morsel); // run on current thread
}
Unlike partition-based parallelism (each thread gets a fixed region), Morsel-Driven scheduling naturally handles data skew.
Performance Characteristics
| Workload | DuckDB Advantage |
|---|---|
| Full table scan | Columnar scan, skip unused columns |
| Aggregation | SIMD-accelerated, vectorized batch aggregation |
| Join | Morsel-Parallel Hash Join |
| In-process analytics | Zero config, embedded deployment |
Summary
DuckDB’s PAX columnar storage + vectorized execution + Morsel parallelism achieves near-large-OLAP performance in an embedded footprint. For data analysis and ETL, it’s the OLAP counterpart to what SQLite does for OLTP.