PostgreSQL HA in Kubernetes

Streaming Replication, WAL, Patroni, CloudNativePG, Multi-Zone & Logical Replication

POSTGRESQL · KUBERNETES · PATRONI · CLOUDNATIVEPG · WAL · PGBOUNCER

1. PostgreSQL HA Deployment Modes

🖥️

Standalone

Single pod, single PVC. No HA. Dev/test only.

🔁

Streaming Replication

Primary → Standby via WAL stream. Manual failover without a manager.

📋

Logical Replication

Replicate specific tables/databases. Supports cross-version and partial sync.

🛡️

Patroni / CloudNativePG

Automated failover with DCS (etcd/consul) consensus. Production standard.

PostgreSQL vs Redis replication: PostgreSQL replication is WAL-based — every change is written to the Write-Ahead Log before it's applied. Standbys replay WAL segments. This means standby always has a consistent, crash-safe copy (unlike Redis async replication which can lose buffered commands).

2. WAL — Write-Ahead Log (The Core of PG Replication)

Client / App INSERT / UPDATE Shared Buffers Dirty pages (in RAM) ② write to WAL buffer WAL Buffer In-memory log ③ fsync to disk (before COMMIT ack) WAL Files on Disk pg_wal/0000001000000000000001 16MB segments, rotated ④ COMMIT ACK to client (after WAL fsync, NOT after heap write) Checkpointer Flushes dirty pages → data files ⑤ periodically WAL Sender Process on primary streams WAL to standbys ⑥ read & stream WAL Receiver Process on standby ⑦ TCP stream WAL Key Facts • WAL = append-only, crash-safe log • COMMIT only acks AFTER WAL fsync • Heap/data files written lazily by checkpointer (default every 5min) • 16MB segments: 000000010000000000000001 • wal_level = replica (for streaming) • wal_level = logical (for pub/sub)
Why WAL makes PG replication safer than Redis: PostgreSQL commits only after WAL is fsynced to disk. The standby replays the same WAL records. Even if the primary crashes mid-transaction, the standby has a consistent view. Redis acks writes before replication — a crash between ack and replication means data loss.
-- Key WAL configuration (postgresql.conf) wal_level = replica -- minimum for streaming replication max_wal_senders = 5 -- max concurrent WAL sender connections wal_keep_size = 512MB -- keep WAL available for slow replicas checkpoint_timeout = 5min -- flush dirty pages every 5min synchronous_commit = on -- default: wait for local WAL fsync -- for sync replication: wait for at least 1 standby synchronous_standby_names = 'FIRST 1 (standby1, standby2)'

3. Streaming Replication — WAL in Real Time

Async (default)

1

Primary writes WAL + fsyncs

Transaction commits after local WAL fsync. Standby receives WAL asynchronously.

2

ACK sent immediately to client

Client doesn't wait for standby to confirm. Highest throughput, small risk of data loss on failover.

Replication lag possible

Standby can be seconds behind under high write load. Monitor with pg_stat_replication.write_lag.

Synchronous

1

Primary writes WAL + waits

Transaction pauses until the designated sync standby confirms WAL written/flushed.

2

Standby confirms (remote_write / remote_apply)

remote_write = standby received in OS buffer. remote_apply = standby applied the changes.

3

Zero data loss on failover

Promoted standby has all committed transactions. Tradeoff: write latency includes network RTT to standby.

Synchronous replication footgun: If all sync standbys go down, writes hang indefinitely by default. Set synchronous_standby_names = 'ANY 1 (s1, s2)' so the primary falls back to async if all standbys disconnect.
?

What is a replication slot?

A persistent cursor on the primary. The primary keeps WAL segments until all consumers of a slot have consumed them. Survives primary restarts.

Why use slots?

Without a slot, a lagging standby might miss WAL that gets recycled. Slots guarantee the standby can always catch up even after downtime.

Danger: unbounded WAL accumulation

If a replica with a slot goes offline for a long time, the primary accumulates WAL forever → disk full → primary crashes. Always monitor pg_replication_slots.active and set max_slot_wal_keep_size.

-- Create a physical replication slot (streaming replication) SELECT * FROM pg_create_physical_replication_slot('standby1_slot'); -- Monitor slots SELECT slot_name, active, wal_status, safe_wal_size, restart_lsn FROM pg_replication_slots; -- Cap how much WAL a slot can hold back (PG 13+) max_slot_wal_keep_size = 2GB -- slot invalidated if it falls behind more than this
📖

Hot Standby

Read queries on standby

Set hot_standby = on (default). Standby serves read-only SELECT queries while replaying WAL.

Read-Your-Writes Problem

Async lag caveat

A client writes to primary, immediately reads from standby — may see stale data due to replication lag. Solve with recovery_min_apply_delay = 0 or sticky sessions.

⚖️

Load Balancing Reads

PgBouncer / HAProxy

Route reads to standbys via PgBouncer pool or HAProxy. K8s Service with app: pg-standby label. Spread read load across all replicas.

🔒

Standby Conflicts

Vacuum vs long queries

Vacuum on primary removes dead tuples that a standby read query still needs → conflict. Tune max_standby_streaming_delay = 30s before canceling standby queries.

-- On standby: check lag in bytes and seconds SELECT pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()) AS receive_lag_bytes, now() - pg_last_xact_replay_timestamp() AS replay_lag_secs; -- On primary: check all standby states SELECT application_name, state, sync_state, write_lag, flush_lag, replay_lag FROM pg_stat_replication;

4. Logical Replication — Table-Level, Cross-Version

📋

Publish / Subscribe Model

PG 10+

Publisher creates a publication for specific tables. Subscriber creates a subscription pointing to the publisher. Row-level changes (INSERT/UPDATE/DELETE) are decoded from WAL and sent.

-- On publisher (source): CREATE PUBLICATION my_pub FOR TABLE orders, products; -- On subscriber (target): CREATE SUBSCRIPTION my_sub CONNECTION 'host=primary port=5432 dbname=app user=repl' PUBLICATION my_pub; -- Monitor replication progress SELECT * FROM pg_stat_subscription;

When to Use Logical vs Streaming

L

Cross-version migration

Upgrade PG 14 → 17 with zero downtime. Logical replication works between major versions (streaming does not).

L

Partial dataset replication

Only replicate specific tables or even filtered rows to a reporting replica or separate microservice database.

L

Multi-master writes (with care)

Two clusters can be publisher and subscriber to each other — but you must handle conflict resolution manually.

S

Full HA failover → use Streaming

Streaming replication is simpler for full standby promotion. Logical is an add-on, not a replacement for streaming HA.

5. Multi-Zone Layout in Kubernetes

ZONE A — Primary
🔵 postgres-0 PRIMARY
🟣 pgbouncer-0 POOL
🟢 etcd-0 DCS
ZONE B — Sync Standby
🟠 postgres-1 SYNC STANDBY
🟣 pgbouncer-1 POOL
🟢 etcd-1 DCS
ZONE C — Async Standby
🟠 postgres-2 ASYNC STANDBY
🟣 pgbouncer-2 POOL
🟢 etcd-2 DCS
Why one sync + one async standby? Zone B (sync) guarantees zero data loss on failover — it always has every committed transaction. Zone C (async) is cheaper and used for read scaling + DR without adding write latency. If Zone A fails, Patroni promotes Zone B instantly.
# Spread PG pods across zones (CloudNativePG) topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: cnpg.io/cluster: pg-cluster # PodAntiAffinity: primary and sync standby # MUST be in different zones affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - topologyKey: topology.kubernetes.io/zone
# K8s Services needed for PG HA # 1. Read-Write service (primary only) name: postgres-rw selector: { role: primary } # Updated by operator on failover # 2. Read-Only service (all standbys) name: postgres-ro selector: { role: replica } # 3. Headless service (pod-level DNS) name: postgres-hl clusterIP: None # postgres-0.postgres-hl.ns.svc.cluster.local

6. Patroni — HA Manager with DCS Consensus

🗳️

DCS (Distributed Config Store)

etcd / Consul / ZooKeeper

Patroni uses a DCS as the source of truth. The leader writes a heartbeat key to DCS every TTL seconds. If the key expires, leader election happens.

🏆

Leader Lock

etcd key with TTL

The primary holds a distributed lock in etcd. If the primary fails to renew it (crash, network partition), the lock expires and another node races to acquire it.

🔄

Failover

Auto + manual

Patroni promotes the standby with highest LSN (least data loss). Reconfigures remaining standbys to follow the new primary automatically.

🌐

REST API

:8008/health

Every Patroni node exposes HTTP API. HAProxy / K8s liveness probes use /primary and /replica endpoints to route traffic correctly.

etcd Cluster (DCS) Leader lock key + cluster config + member state Primary (Zone A) PostgreSQL (read/write) pg_wal/ → WAL sender Patroni :8008 (holds lock) Standby 1 (Zone B) PostgreSQL (read-only) WAL receiver ← streams Patroni :8008 (watches lock) Standby 2 (Zone C) PostgreSQL (read-only) WAL receiver ← streams Patroni :8008 (watches lock) renew lock TTL watch lock WAL HAProxy / K8s Service routes writes → primary, reads → standbys health check: GET :8008/primary
# patroni.yml (simplified) scope: pg-cluster name: postgres-0 etcd3: hosts: etcd-0:2379,etcd-1:2379,etcd-2:2379 bootstrap: dcs: ttl: 30 # leader lock TTL in seconds loop_wait: 10 # Patroni heartbeat interval retry_timeout: 10 maximum_lag_on_failover: 1048576 # 1MB — skip lagging standbys synchronous_mode: true # enforce sync standby postgresql: parameters: wal_level: logical max_wal_senders: 5 synchronous_standby_names: 'ANY 1 (*)'

7. CloudNativePG — Kubernetes-Native Operator

☸️

CRD-Driven

Cluster + Backup + ScheduledBackup

Single Cluster CRD manages the entire PostgreSQL cluster: pods, services, PVCs, config, backups, failover.

🚫

No Patroni / etcd Needed

Uses K8s API as DCS

Uses Kubernetes leases (not etcd) for leader election. The K8s API server is the source of truth. Simpler dependency stack.

💾

Continuous Backup (WAL archiving)

S3 / GCS / Azure

Barman Cloud archives WAL to object storage continuously. Point-in-time recovery (PITR) to any second. New standbys restored from object storage.

🔄

Declarative Rolling Updates

Zero-downtime upgrades

Upgrading PG minor version: operator restarts standbys first (with new image), then triggers a switchover so primary becomes standby, then restarts it. Zero writes blocked.

# CloudNativePG Cluster manifest apiVersion: postgresql.cnpg.io/v1 kind: Cluster metadata: name: pg-cluster spec: instances: 3 # 1 primary + 2 standbys imageName: ghcr.io/cloudnative-pg/postgresql:16 postgresql: parameters: shared_buffers: 256MB max_connections: 200 storage: size: 100Gi storageClass: premium-ssd backup: barmanObjectStore: destinationPath: s3://my-bucket/pg-backups s3Credentials: accessKeyId: { name: aws-creds, key: ACCESS_KEY_ID } topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule
CloudNativePG generates 3 Services automatically: pg-cluster-rw (primary), pg-cluster-ro (standbys), pg-cluster-r (all instances). No manual Service management needed.

8. PgBouncer — Connection Pooling (Critical for K8s)

Why connection pooling is mandatory in K8s: PostgreSQL forks a process per connection. With hundreds of microservice pods each holding a connection pool, you can easily hit max_connections (default 100). PgBouncer multiplexes many app connections onto a few server connections.
🔀

Pool Modes

Session — 1 server conn per client session. Like no pooling. Safe for all queries.


Transaction — server conn returned to pool after each transaction. Best throughput. Breaks SET, LISTEN, prepared statements.


Statement — returned after each statement. Aggressive. Breaks multi-statement transactions. Rarely used.

# pgbouncer.ini [databases] app = host=postgres-rw port=5432 dbname=app app_ro = host=postgres-ro port=5432 dbname=app [pgbouncer] pool_mode = transaction # recommended max_client_conn = 1000 # max app connections default_pool_size = 20 # server conns per db+user min_pool_size = 5 reserve_pool_size = 5 server_idle_timeout = 600 # close idle server conns # In K8s: run as sidecar or DaemonSet # Apps connect to pgbouncer:5432, not postgres directly

9. Kubernetes Patterns for PostgreSQL

StatefulSet vs Operator

📦

StatefulSet alone

Gives stable pod names + PVCs. But YOU must handle failover, config changes, backup, and replica reconfiguration. Very manual.

☸️

Operator (CloudNativePG / Patroni)

Operator manages StatefulSets internally + handles: failover, Service updates, backup scheduling, WAL archiving, config reload, switchover. Use this.

Storage Considerations

💾

Use local NVMe or premium SSD

WAL writes are sequential and latency-sensitive. Network-attached storage (EBS gp2, NFS) adds milliseconds per fsync. Use gp3 with provisioned IOPS or local SSDs.

Never use ReadWriteMany (NFS) for PG data

RWX volumes have no fsync guarantees and can cause corruption. Always use ReadWriteOnce PVCs with one pod per PVC.

📤

WAL volume separate from data volume

Put pg_wal on a separate PVC. Isolates WAL I/O from heap I/O, prevents WAL filling data disk.

# Separate WAL volume in StatefulSet volumeClaimTemplates volumeClaimTemplates: - metadata: { name: data } spec: accessModes: [ReadWriteOnce] storageClassName: premium-ssd resources: { requests: { storage: 500Gi } } - metadata: { name: wal } spec: accessModes: [ReadWriteOnce] storageClassName: nvme-local # fast local for WAL resources: { requests: { storage: 50Gi } }

10. Patroni Failover — Step-by-Step

1

Primary pod crashes / OOMKilled / Zone A goes down

postgres-0 stops. Patroni process on postgres-0 is gone. The etcd leader lock TTL countdown begins.

2

etcd TTL expires (default 30s)

The leader key in etcd expires. Patroni on standby nodes detects the lock is gone and races to acquire it.

3

Candidate standbys check eligibility

Each standby reports its LSN (log position) to etcd. Standbys with lag > maximum_lag_on_failover are excluded. Synchronous standby is preferred.

4

Leader election — one standby wins etcd lock

The standby with the best LSN (or the sync standby) acquires the etcd lock. Other standbys back off.

5

Winning standby runs pg_ctl promote

PostgreSQL exits recovery mode, creates a recovery.signal removal, generates a new timeline, and starts accepting writes.

6

Other standbys reconfigured

Patroni updates primary_conninfo on remaining standbys to point to the new primary. They reconnect and continue streaming WAL from the new timeline.

7

K8s Service updated

Operator updates postgres-rw Service selector to point to the new primary pod. New writes are immediately routed to the promoted pod.

8

PgBouncer reconnects

PgBouncer detects the old primary is gone and reconnects to the new host (via Service DNS). In-flight transactions fail with a connection error — apps must retry.

9

Old primary joins as standby (if it recovers)

When postgres-0 restarts, Patroni detects a newer timeline exists, runs pg_rewind to rewind to the failover point, then reattaches as standby.

Total failover time (Patroni): TTL (30s) + election + promote ≈ 30–45s. CloudNativePG is typically 5–15s (uses K8s lease with shorter TTL and watchdog). With synchronous standby, zero data loss. With async, potential loss of transactions committed in the last few seconds.
# Trigger manual switchover (graceful, zero data loss) patronictl -c /etc/patroni.yml switchover --master postgres-0 --candidate postgres-1 # Force failover (if master unreachable) patronictl -c /etc/patroni.yml failover pg-cluster --candidate postgres-1 --force # Check cluster state patronictl -c /etc/patroni.yml list # CloudNativePG promote a standby kubectl annotate cluster pg-cluster cnpg.io/failoverDelay="0"

11. Operator & Tool Comparison

Tool Type DCS Failover Time Backup K8s-Native Logical Repl Best For
CloudNativePG Operator CRD K8s Leases 5–15s Barman (S3/GCS) Yes Yes New K8s deployments, CNCF-aligned
Patroni + etcd HA Manager etcd / Consul 30–45s pgBackRest / Barman Partial Yes Existing infra, more control
Crunchy PGO Operator CRD K8s Leases 5–20s pgBackRest Yes Yes Enterprise + OpenShift
Bitnami Helm Helm Chart None (manual) Manual Manual / CronJob No Manual Quick dev/test setup only
Zalando Operator Operator CRD etcd (Patroni) 30–45s WAL-E / Spilo Yes Yes Large teams using Patroni base

12. Production Readiness Checklist

Replication & HA

3 instances across 3 zones

Primary in Zone A, sync standby in Zone B, async standby in Zone C.

Synchronous standby configured

synchronous_standby_names = 'ANY 1 (*)' for zero data loss with fallback.

Replication slots monitored

Alert if pg_replication_slots.active = false or WAL retained > 10GB.

Failover tested in staging

Run patronictl switchover monthly. Verify RTO and RPO meet SLA.

PgBouncer in transaction mode

Apps connect to PgBouncer, not Postgres directly. Max connections never hits limit.

Backup & Observability

Continuous WAL archiving to S3

WAL archived every segment (16MB). Enables PITR. Barman Cloud or pgBackRest.

Daily base backup

ScheduledBackup CRD (CloudNativePG) or pg_basebackup via CronJob.

Restore tested monthly

Restore latest backup to a test namespace. Verify row counts and schema integrity.

Alerts on replication lag

Alert if standby lag > 30s (warning) or > 5min (critical). Use pg_stat_replication.

PodDisruptionBudget

minAvailable: 2 — prevents draining more than 1 PG node during rolling node upgrades.

-- Key monitoring queries -- Replication lag on primary SELECT application_name, state, sync_state, write_lag, flush_lag, replay_lag FROM pg_stat_replication; -- Long-running queries (lock contention) SELECT pid, now() - query_start AS duration, state, query FROM pg_stat_activity WHERE state != 'idle' AND query_start < now() - interval '5 minutes' ORDER BY duration DESC; -- Bloat / dead tuples per table SELECT relname, n_dead_tup, n_live_tup, last_autovacuum FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 10; -- Connection count by state SELECT state, count(*) FROM pg_stat_activity GROUP BY state;