TL;DR
PostgreSQL uses Clock-sweep (not LRU) to manage page replacement in shared_buffers. This article walks through the core implementation in src/backend/storage/buffer/bufmgr.c.
Why Not LRU?
Classic LRU requires a global ordered list, which causes heavy lock contention under concurrency. PostgreSQL’s Clock-sweep uses a circular buffer with a clock hand that decrements usage_count on each pass. The first page reaching usage_count == 0 becomes the victim.
Core Data Structure
// src/include/storage/buf_internals.h
typedef struct BufferDesc {
BufferTag tag; // page identifier
pg_atomic_uint32 state; // status bits
int buf_id; // buffer index
int freeNext; // freelist pointer
LWLock content_lock; // per-buffer lock
} BufferDesc;
usage_count is embedded in the atomic state field, capped at BM_MAX_USAGE_COUNT (5).
Clock-sweep in Action
// src/backend/storage/buffer/bufmgr.c
static BufferDesc *
StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state) {
for (;;) {
// 1. Advance clock hand
StrategyControl->nextVictimBuffer++;
if (StrategyControl->nextVictimBuffer >= NBuffers)
StrategyControl->nextVictimBuffer = 0;
buf = GetBufferDescriptor(StrategyControl->nextVictimBuffer);
// 2. Decrement usage_count
local_buf_state = LockBufHdr(buf);
usage_count = BUF_STATE_GET_USAGECOUNT(local_buf_state);
if (usage_count > 0) {
local_buf_state -= BUF_USAGECOUNT_ONE; // usage_count--
// continue scanning
} else {
// usage_count == 0 → selected as victim
*buf_state = local_buf_state;
return buf;
}
}
}
Key Insights
- Clock hand wraps around: Each lap decrements
usage_count; afterBM_MAX_USAGE_COUNTlaps, any page will eventually reach zero - NUMA-friendly: No global ordered structure; locking at individual buffer granularity
- Pin-count decoupled: Pinned pages are never evicted;
usage_countincrements happen at pin time
Access Pattern Behavior
Sequential scan:
↓ (each page touched once)
usage_count: 1 → clock pass → 0 → evicted next lap ✓
Hot page (index root node):
↓ (frequent access)
usage_count: 1 → clock pass → 0 → re-pinned → 1
→ stays in cache ✓
This mirrors LRU-K: one-hit-wonder pages are evicted quickly, frequently accessed pages persist.
Summary
PostgreSQL’s Clock-sweep is a pragmatic concurrent optimization of LRU. The core idea: replace ordered lists with circular scanning, and replace exact ranking with multi-pass decrement.