PostgreSQL HA in Kubernetes — Replication, Failover & Multi-Zone

1. PostgreSQL HA Deployment Modes

🖥️

Standalone

Single pod, single PVC. No HA. Dev/test only.

🔁

Streaming Replication

Primary → Standby via WAL stream. Manual failover without a manager.

📋

Logical Replication

Replicate specific tables/databases. Supports cross-version and partial sync.

🛡️

Patroni / CloudNativePG

Automated failover with DCS (etcd/consul) consensus. Production standard.

PostgreSQL vs Redis replication: PostgreSQL replication is WAL-based — every change is written to the Write-Ahead Log before it's applied. Standbys replay WAL segments. This means standby always has a consistent, crash-safe copy (unlike Redis async replication which can lose buffered commands).

2. WAL — Write-Ahead Log (The Core of PG Replication)

Why WAL makes PG replication safer than Redis: PostgreSQL commits only after WAL is fsynced to disk. The standby replays the same WAL records. Even if the primary crashes mid-transaction, the standby has a consistent view. Redis acks writes before replication — a crash between ack and replication means data loss.

-- Key WAL configuration (postgresql.conf)
wal_level = replica         -- minimum for streaming replication
max_wal_senders = 5         -- max concurrent WAL sender connections
wal_keep_size = 512MB       -- keep WAL available for slow replicas
checkpoint_timeout = 5min  -- flush dirty pages every 5min
synchronous_commit = on    -- default: wait for local WAL fsync
-- for sync replication: wait for at least 1 standby
synchronous_standby_names = 'FIRST 1 (standby1, standby2)'
  

3. Streaming Replication — WAL in Real Time

Async (default)

Primary writes WAL + fsyncs

Transaction commits after local WAL fsync. Standby receives WAL asynchronously.

ACK sent immediately to client

Client doesn't wait for standby to confirm. Highest throughput, small risk of data loss on failover.

⚠

Replication lag possible

Standby can be seconds behind under high write load. Monitor with pg_stat_replication.write_lag.

Synchronous

Primary writes WAL + waits

Transaction pauses until the designated sync standby confirms WAL written/flushed.

Standby confirms (remote_write / remote_apply)

remote_write = standby received in OS buffer. remote_apply = standby applied the changes.

Zero data loss on failover

Promoted standby has all committed transactions. Tradeoff: write latency includes network RTT to standby.

Synchronous replication footgun: If all sync standbys go down, writes hang indefinitely by default. Set synchronous_standby_names = 'ANY 1 (s1, s2)' so the primary falls back to async if all standbys disconnect.

What is a replication slot?

A persistent cursor on the primary. The primary keeps WAL segments until all consumers of a slot have consumed them. Survives primary restarts.

✓

Why use slots?

Without a slot, a lagging standby might miss WAL that gets recycled. Slots guarantee the standby can always catch up even after downtime.

⚠

Danger: unbounded WAL accumulation

If a replica with a slot goes offline for a long time, the primary accumulates WAL forever → disk full → primary crashes. Always monitor pg_replication_slots.active and set max_slot_wal_keep_size.

-- Create a physical replication slot (streaming replication)
SELECT * FROM pg_create_physical_replication_slot('standby1_slot');

-- Monitor slots
SELECT slot_name, active, wal_status, safe_wal_size, restart_lsn
FROM pg_replication_slots;

-- Cap how much WAL a slot can hold back (PG 13+)
max_slot_wal_keep_size = 2GB  -- slot invalidated if it falls behind more than this
    

📖

Hot Standby

Read queries on standby

Set hot_standby = on (default). Standby serves read-only SELECT queries while replaying WAL.

⏳

Read-Your-Writes Problem

Async lag caveat

A client writes to primary, immediately reads from standby — may see stale data due to replication lag. Solve with recovery_min_apply_delay = 0 or sticky sessions.

⚖️

Load Balancing Reads

PgBouncer / HAProxy

Route reads to standbys via PgBouncer pool or HAProxy. K8s Service with app: pg-standby label. Spread read load across all replicas.

🔒

Standby Conflicts

Vacuum vs long queries

Vacuum on primary removes dead tuples that a standby read query still needs → conflict. Tune max_standby_streaming_delay = 30s before canceling standby queries.

-- On standby: check lag in bytes and seconds
SELECT
  pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()) AS receive_lag_bytes,
  now() - pg_last_xact_replay_timestamp() AS replay_lag_secs;

-- On primary: check all standby states
SELECT application_name, state, sync_state, write_lag, flush_lag, replay_lag
FROM pg_stat_replication;
    

4. Logical Replication — Table-Level, Cross-Version

📋

Publish / Subscribe Model

PG 10+

Publisher creates a publication for specific tables. Subscriber creates a subscription pointing to the publisher. Row-level changes (INSERT/UPDATE/DELETE) are decoded from WAL and sent.

-- On publisher (source):
CREATE PUBLICATION my_pub
  FOR TABLE orders, products;

-- On subscriber (target):
CREATE SUBSCRIPTION my_sub
  CONNECTION 'host=primary port=5432 dbname=app user=repl'
  PUBLICATION my_pub;

-- Monitor replication progress
SELECT * FROM pg_stat_subscription;
      

When to Use Logical vs Streaming

Cross-version migration

Upgrade PG 14 → 17 with zero downtime. Logical replication works between major versions (streaming does not).

Partial dataset replication

Only replicate specific tables or even filtered rows to a reporting replica or separate microservice database.

Multi-master writes (with care)

Two clusters can be publisher and subscriber to each other — but you must handle conflict resolution manually.

Full HA failover → use Streaming

Streaming replication is simpler for full standby promotion. Logical is an add-on, not a replacement for streaming HA.

5. Multi-Zone Layout in Kubernetes

ZONE A — Primary

🔵 postgres-0 PRIMARY

🟣 pgbouncer-0 POOL

🟢 etcd-0 DCS

ZONE B — Sync Standby

🟠 postgres-1 SYNC STANDBY

🟣 pgbouncer-1 POOL

🟢 etcd-1 DCS

ZONE C — Async Standby

🟠 postgres-2 ASYNC STANDBY

🟣 pgbouncer-2 POOL

🟢 etcd-2 DCS

Why one sync + one async standby? Zone B (sync) guarantees zero data loss on failover — it always has every committed transaction. Zone C (async) is cheaper and used for read scaling + DR without adding write latency. If Zone A fails, Patroni promotes Zone B instantly.

# Spread PG pods across zones (CloudNativePG)
topologySpreadConstraints:
- maxSkew: 1
  topologyKey: topology.kubernetes.io/zone
  whenUnsatisfiable: DoNotSchedule
  labelSelector:
    matchLabels:
      cnpg.io/cluster: pg-cluster

# PodAntiAffinity: primary and sync standby
# MUST be in different zones
affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - topologyKey: topology.kubernetes.io/zone
      

# K8s Services needed for PG HA

# 1. Read-Write service (primary only)
name: postgres-rw
selector: { role: primary }
# Updated by operator on failover

# 2. Read-Only service (all standbys)
name: postgres-ro
selector: { role: replica }

# 3. Headless service (pod-level DNS)
name: postgres-hl
clusterIP: None
# postgres-0.postgres-hl.ns.svc.cluster.local
      

6. Patroni — HA Manager with DCS Consensus

🗳️

DCS (Distributed Config Store)

etcd / Consul / ZooKeeper

Patroni uses a DCS as the source of truth. The leader writes a heartbeat key to DCS every TTL seconds. If the key expires, leader election happens.

🏆

Leader Lock

etcd key with TTL

The primary holds a distributed lock in etcd. If the primary fails to renew it (crash, network partition), the lock expires and another node races to acquire it.

🔄

Failover

Auto + manual

Patroni promotes the standby with highest LSN (least data loss). Reconfigures remaining standbys to follow the new primary automatically.

🌐

REST API

:8008/health

Every Patroni node exposes HTTP API. HAProxy / K8s liveness probes use /primary and /replica endpoints to route traffic correctly.

# patroni.yml (simplified)
scope: pg-cluster
name: postgres-0

etcd3:
  hosts: etcd-0:2379,etcd-1:2379,etcd-2:2379

bootstrap:
  dcs:
    ttl: 30                      # leader lock TTL in seconds
    loop_wait: 10                 # Patroni heartbeat interval
    retry_timeout: 10
    maximum_lag_on_failover: 1048576  # 1MB — skip lagging standbys
    synchronous_mode: true       # enforce sync standby

postgresql:
  parameters:
    wal_level: logical
    max_wal_senders: 5
    synchronous_standby_names: 'ANY 1 (*)'
  

7. CloudNativePG — Kubernetes-Native Operator

☸️

CRD-Driven

Cluster + Backup + ScheduledBackup

Single Cluster CRD manages the entire PostgreSQL cluster: pods, services, PVCs, config, backups, failover.

🚫

No Patroni / etcd Needed

Uses K8s API as DCS

Uses Kubernetes leases (not etcd) for leader election. The K8s API server is the source of truth. Simpler dependency stack.

💾

Continuous Backup (WAL archiving)

S3 / GCS / Azure

Barman Cloud archives WAL to object storage continuously. Point-in-time recovery (PITR) to any second. New standbys restored from object storage.

🔄

Declarative Rolling Updates

Zero-downtime upgrades

Upgrading PG minor version: operator restarts standbys first (with new image), then triggers a switchover so primary becomes standby, then restarts it. Zero writes blocked.

# CloudNativePG Cluster manifest
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: pg-cluster
spec:
  instances: 3             # 1 primary + 2 standbys
  imageName: ghcr.io/cloudnative-pg/postgresql:16

  postgresql:
    parameters:
      shared_buffers: 256MB
      max_connections: 200

  storage:
    size: 100Gi
    storageClass: premium-ssd

  backup:
    barmanObjectStore:
      destinationPath: s3://my-bucket/pg-backups
      s3Credentials:
        accessKeyId: { name: aws-creds, key: ACCESS_KEY_ID }

  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
  

CloudNativePG generates 3 Services automatically: pg-cluster-rw (primary), pg-cluster-ro (standbys), pg-cluster-r (all instances). No manual Service management needed.

8. PgBouncer — Connection Pooling (Critical for K8s)

Why connection pooling is mandatory in K8s: PostgreSQL forks a process per connection. With hundreds of microservice pods each holding a connection pool, you can easily hit max_connections (default 100). PgBouncer multiplexes many app connections onto a few server connections.

🔀

Pool Modes

Session — 1 server conn per client session. Like no pooling. Safe for all queries.

Transaction — server conn returned to pool after each transaction. Best throughput. Breaks SET, LISTEN, prepared statements.

Statement — returned after each statement. Aggressive. Breaks multi-statement transactions. Rarely used.

# pgbouncer.ini
[databases]
app = host=postgres-rw port=5432 dbname=app
app_ro = host=postgres-ro port=5432 dbname=app

[pgbouncer]
pool_mode = transaction       # recommended
max_client_conn = 1000        # max app connections
default_pool_size = 20        # server conns per db+user
min_pool_size = 5
reserve_pool_size = 5
server_idle_timeout = 600     # close idle server conns

# In K8s: run as sidecar or DaemonSet
# Apps connect to pgbouncer:5432, not postgres directly
      

9. Kubernetes Patterns for PostgreSQL

StatefulSet vs Operator

📦

StatefulSet alone

Gives stable pod names + PVCs. But YOU must handle failover, config changes, backup, and replica reconfiguration. Very manual.

☸️

Operator (CloudNativePG / Patroni)

Operator manages StatefulSets internally + handles: failover, Service updates, backup scheduling, WAL archiving, config reload, switchover. Use this.

Storage Considerations

💾

Use local NVMe or premium SSD

WAL writes are sequential and latency-sensitive. Network-attached storage (EBS gp2, NFS) adds milliseconds per fsync. Use gp3 with provisioned IOPS or local SSDs.

⚠

Never use ReadWriteMany (NFS) for PG data

RWX volumes have no fsync guarantees and can cause corruption. Always use ReadWriteOnce PVCs with one pod per PVC.

📤

WAL volume separate from data volume

Put pg_wal on a separate PVC. Isolates WAL I/O from heap I/O, prevents WAL filling data disk.

# Separate WAL volume in StatefulSet volumeClaimTemplates
volumeClaimTemplates:
- metadata: { name: data }
  spec:
    accessModes: [ReadWriteOnce]
    storageClassName: premium-ssd
    resources: { requests: { storage: 500Gi } }
- metadata: { name: wal }
  spec:
    accessModes: [ReadWriteOnce]
    storageClassName: nvme-local  # fast local for WAL
    resources: { requests: { storage: 50Gi } }
  

10. Patroni Failover — Step-by-Step

Primary pod crashes / OOMKilled / Zone A goes down

postgres-0 stops. Patroni process on postgres-0 is gone. The etcd leader lock TTL countdown begins.

etcd TTL expires (default 30s)

The leader key in etcd expires. Patroni on standby nodes detects the lock is gone and races to acquire it.

Candidate standbys check eligibility

Each standby reports its LSN (log position) to etcd. Standbys with lag > maximum_lag_on_failover are excluded. Synchronous standby is preferred.

Leader election — one standby wins etcd lock

The standby with the best LSN (or the sync standby) acquires the etcd lock. Other standbys back off.

Winning standby runs pg_ctl promote

PostgreSQL exits recovery mode, creates a recovery.signal removal, generates a new timeline, and starts accepting writes.

Other standbys reconfigured

Patroni updates primary_conninfo on remaining standbys to point to the new primary. They reconnect and continue streaming WAL from the new timeline.

K8s Service updated

Operator updates postgres-rw Service selector to point to the new primary pod. New writes are immediately routed to the promoted pod.

PgBouncer reconnects

PgBouncer detects the old primary is gone and reconnects to the new host (via Service DNS). In-flight transactions fail with a connection error — apps must retry.

Old primary joins as standby (if it recovers)

When postgres-0 restarts, Patroni detects a newer timeline exists, runs pg_rewind to rewind to the failover point, then reattaches as standby.

Total failover time (Patroni): TTL (30s) + election + promote ≈ 30–45s. CloudNativePG is typically 5–15s (uses K8s lease with shorter TTL and watchdog). With synchronous standby, zero data loss. With async, potential loss of transactions committed in the last few seconds.

# Trigger manual switchover (graceful, zero data loss)
patronictl -c /etc/patroni.yml switchover --master postgres-0 --candidate postgres-1

# Force failover (if master unreachable)
patronictl -c /etc/patroni.yml failover pg-cluster --candidate postgres-1 --force

# Check cluster state
patronictl -c /etc/patroni.yml list

# CloudNativePG promote a standby
kubectl annotate cluster pg-cluster cnpg.io/failoverDelay="0"
  

11. Operator & Tool Comparison

Tool	Type	DCS	Failover Time	Backup	K8s-Native	Logical Repl	Best For
CloudNativePG	Operator CRD	K8s Leases	5–15s	Barman (S3/GCS)	Yes	Yes	New K8s deployments, CNCF-aligned
Patroni + etcd	HA Manager	etcd / Consul	30–45s	pgBackRest / Barman	Partial	Yes	Existing infra, more control
Crunchy PGO	Operator CRD	K8s Leases	5–20s	pgBackRest	Yes	Yes	Enterprise + OpenShift
Bitnami Helm	Helm Chart	None (manual)	Manual	Manual / CronJob	No	Manual	Quick dev/test setup only
Zalando Operator	Operator CRD	etcd (Patroni)	30–45s	WAL-E / Spilo	Yes	Yes	Large teams using Patroni base

12. Production Readiness Checklist

Replication & HA

✓

3 instances across 3 zones

Primary in Zone A, sync standby in Zone B, async standby in Zone C.

✓

Synchronous standby configured

synchronous_standby_names = 'ANY 1 (*)' for zero data loss with fallback.

✓

Replication slots monitored

Alert if pg_replication_slots.active = false or WAL retained > 10GB.

✓

Failover tested in staging

Run patronictl switchover monthly. Verify RTO and RPO meet SLA.

✓

PgBouncer in transaction mode

Apps connect to PgBouncer, not Postgres directly. Max connections never hits limit.

Backup & Observability

✓

Continuous WAL archiving to S3

WAL archived every segment (16MB). Enables PITR. Barman Cloud or pgBackRest.

✓

Daily base backup

ScheduledBackup CRD (CloudNativePG) or pg_basebackup via CronJob.

✓

Restore tested monthly

Restore latest backup to a test namespace. Verify row counts and schema integrity.

✓

Alerts on replication lag

Alert if standby lag > 30s (warning) or > 5min (critical). Use pg_stat_replication.

✓

PodDisruptionBudget

minAvailable: 2 — prevents draining more than 1 PG node during rolling node upgrades.

-- Key monitoring queries

-- Replication lag on primary
SELECT application_name, state, sync_state,
  write_lag, flush_lag, replay_lag
FROM pg_stat_replication;

-- Long-running queries (lock contention)
SELECT pid, now() - query_start AS duration, state, query
FROM pg_stat_activity
WHERE state != 'idle' AND query_start < now() - interval '5 minutes'
ORDER BY duration DESC;

-- Bloat / dead tuples per table
SELECT relname, n_dead_tup, n_live_tup, last_autovacuum
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC LIMIT 10;

-- Connection count by state
SELECT state, count(*) FROM pg_stat_activity GROUP BY state;