TL;DR

PostgreSQL uses Clock-sweep (not LRU) to manage page replacement in shared_buffers. This article walks through the core implementation in src/backend/storage/buffer/bufmgr.c.

Why Not LRU?

Classic LRU requires a global ordered list, which causes heavy lock contention under concurrency. PostgreSQL’s Clock-sweep uses a circular buffer with a clock hand that decrements usage_count on each pass. The first page reaching usage_count == 0 becomes the victim.

Core Data Structure

// src/include/storage/buf_internals.h

typedef struct BufferDesc {
    BufferTag   tag;           // page identifier
    pg_atomic_uint32 state;    // status bits
    int         buf_id;        // buffer index
    int         freeNext;      // freelist pointer
    LWLock      content_lock;  // per-buffer lock
} BufferDesc;

usage_count is embedded in the atomic state field, capped at BM_MAX_USAGE_COUNT (5).

Clock-sweep in Action

// src/backend/storage/buffer/bufmgr.c

static BufferDesc *
StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state) {
    for (;;) {
        // 1. Advance clock hand
        StrategyControl->nextVictimBuffer++;
        if (StrategyControl->nextVictimBuffer >= NBuffers)
            StrategyControl->nextVictimBuffer = 0;

        buf = GetBufferDescriptor(StrategyControl->nextVictimBuffer);

        // 2. Decrement usage_count
        local_buf_state = LockBufHdr(buf);
        usage_count = BUF_STATE_GET_USAGECOUNT(local_buf_state);

        if (usage_count > 0) {
            local_buf_state -= BUF_USAGECOUNT_ONE;  // usage_count--
            // continue scanning
        } else {
            // usage_count == 0 → selected as victim
            *buf_state = local_buf_state;
            return buf;
        }
    }
}

Key Insights

  1. Clock hand wraps around: Each lap decrements usage_count; after BM_MAX_USAGE_COUNT laps, any page will eventually reach zero
  2. NUMA-friendly: No global ordered structure; locking at individual buffer granularity
  3. Pin-count decoupled: Pinned pages are never evicted; usage_count increments happen at pin time

Access Pattern Behavior

Sequential scan:
         ↓ (each page touched once)
usage_count: 1 → clock pass → 0 → evicted next lap ✓

Hot page (index root node):
         ↓ (frequent access)
usage_count: 1 → clock pass → 0 → re-pinned → 1
         → stays in cache ✓

This mirrors LRU-K: one-hit-wonder pages are evicted quickly, frequently accessed pages persist.

Summary

PostgreSQL’s Clock-sweep is a pragmatic concurrent optimization of LRU. The core idea: replace ordered lists with circular scanning, and replace exact ranking with multi-pass decrement.

References