1. PostgreSQL HA Deployment Modes
Standalone
Single pod, single PVC. No HA. Dev/test only.
Streaming Replication
Primary → Standby via WAL stream. Manual failover without a manager.
Logical Replication
Replicate specific tables/databases. Supports cross-version and partial sync.
Patroni / CloudNativePG
Automated failover with DCS (etcd/consul) consensus. Production standard.
2. WAL — Write-Ahead Log (The Core of PG Replication)
3. Streaming Replication — WAL in Real Time
Async (default)
Primary writes WAL + fsyncs
Transaction commits after local WAL fsync. Standby receives WAL asynchronously.
ACK sent immediately to client
Client doesn't wait for standby to confirm. Highest throughput, small risk of data loss on failover.
Replication lag possible
Standby can be seconds behind under high write load. Monitor with pg_stat_replication.write_lag.
Synchronous
Primary writes WAL + waits
Transaction pauses until the designated sync standby confirms WAL written/flushed.
Standby confirms (remote_write / remote_apply)
remote_write = standby received in OS buffer. remote_apply = standby applied the changes.
Zero data loss on failover
Promoted standby has all committed transactions. Tradeoff: write latency includes network RTT to standby.
synchronous_standby_names = 'ANY 1 (s1, s2)' so the primary falls back to async if all standbys disconnect.
What is a replication slot?
A persistent cursor on the primary. The primary keeps WAL segments until all consumers of a slot have consumed them. Survives primary restarts.
Why use slots?
Without a slot, a lagging standby might miss WAL that gets recycled. Slots guarantee the standby can always catch up even after downtime.
Danger: unbounded WAL accumulation
If a replica with a slot goes offline for a long time, the primary accumulates WAL forever → disk full → primary crashes. Always monitor pg_replication_slots.active and set max_slot_wal_keep_size.
Hot Standby
Set hot_standby = on (default). Standby serves read-only SELECT queries while replaying WAL.
Read-Your-Writes Problem
A client writes to primary, immediately reads from standby — may see stale data due to replication lag. Solve with recovery_min_apply_delay = 0 or sticky sessions.
Load Balancing Reads
Route reads to standbys via PgBouncer pool or HAProxy. K8s Service with app: pg-standby label. Spread read load across all replicas.
Standby Conflicts
Vacuum on primary removes dead tuples that a standby read query still needs → conflict. Tune max_standby_streaming_delay = 30s before canceling standby queries.
4. Logical Replication — Table-Level, Cross-Version
Publish / Subscribe Model
Publisher creates a publication for specific tables. Subscriber creates a subscription pointing to the publisher. Row-level changes (INSERT/UPDATE/DELETE) are decoded from WAL and sent.
When to Use Logical vs Streaming
Cross-version migration
Upgrade PG 14 → 17 with zero downtime. Logical replication works between major versions (streaming does not).
Partial dataset replication
Only replicate specific tables or even filtered rows to a reporting replica or separate microservice database.
Multi-master writes (with care)
Two clusters can be publisher and subscriber to each other — but you must handle conflict resolution manually.
Full HA failover → use Streaming
Streaming replication is simpler for full standby promotion. Logical is an add-on, not a replacement for streaming HA.
5. Multi-Zone Layout in Kubernetes
6. Patroni — HA Manager with DCS Consensus
DCS (Distributed Config Store)
Patroni uses a DCS as the source of truth. The leader writes a heartbeat key to DCS every TTL seconds. If the key expires, leader election happens.
Leader Lock
The primary holds a distributed lock in etcd. If the primary fails to renew it (crash, network partition), the lock expires and another node races to acquire it.
Failover
Patroni promotes the standby with highest LSN (least data loss). Reconfigures remaining standbys to follow the new primary automatically.
REST API
Every Patroni node exposes HTTP API. HAProxy / K8s liveness probes use /primary and /replica endpoints to route traffic correctly.
7. CloudNativePG — Kubernetes-Native Operator
CRD-Driven
Single Cluster CRD manages the entire PostgreSQL cluster: pods, services, PVCs, config, backups, failover.
No Patroni / etcd Needed
Uses Kubernetes leases (not etcd) for leader election. The K8s API server is the source of truth. Simpler dependency stack.
Continuous Backup (WAL archiving)
Barman Cloud archives WAL to object storage continuously. Point-in-time recovery (PITR) to any second. New standbys restored from object storage.
Declarative Rolling Updates
Upgrading PG minor version: operator restarts standbys first (with new image), then triggers a switchover so primary becomes standby, then restarts it. Zero writes blocked.
pg-cluster-rw (primary), pg-cluster-ro (standbys), pg-cluster-r (all instances). No manual Service management needed.
8. PgBouncer — Connection Pooling (Critical for K8s)
max_connections (default 100). PgBouncer multiplexes many app connections onto a few server connections.
Pool Modes
Session — 1 server conn per client session. Like no pooling. Safe for all queries.
Transaction — server conn returned to pool after each transaction. Best throughput. Breaks SET, LISTEN, prepared statements.
Statement — returned after each statement. Aggressive. Breaks multi-statement transactions. Rarely used.
9. Kubernetes Patterns for PostgreSQL
StatefulSet vs Operator
StatefulSet alone
Gives stable pod names + PVCs. But YOU must handle failover, config changes, backup, and replica reconfiguration. Very manual.
Operator (CloudNativePG / Patroni)
Operator manages StatefulSets internally + handles: failover, Service updates, backup scheduling, WAL archiving, config reload, switchover. Use this.
Storage Considerations
Use local NVMe or premium SSD
WAL writes are sequential and latency-sensitive. Network-attached storage (EBS gp2, NFS) adds milliseconds per fsync. Use gp3 with provisioned IOPS or local SSDs.
Never use ReadWriteMany (NFS) for PG data
RWX volumes have no fsync guarantees and can cause corruption. Always use ReadWriteOnce PVCs with one pod per PVC.
WAL volume separate from data volume
Put pg_wal on a separate PVC. Isolates WAL I/O from heap I/O, prevents WAL filling data disk.
10. Patroni Failover — Step-by-Step
Primary pod crashes / OOMKilled / Zone A goes down
postgres-0 stops. Patroni process on postgres-0 is gone. The etcd leader lock TTL countdown begins.
etcd TTL expires (default 30s)
The leader key in etcd expires. Patroni on standby nodes detects the lock is gone and races to acquire it.
Candidate standbys check eligibility
Each standby reports its LSN (log position) to etcd. Standbys with lag > maximum_lag_on_failover are excluded. Synchronous standby is preferred.
Leader election — one standby wins etcd lock
The standby with the best LSN (or the sync standby) acquires the etcd lock. Other standbys back off.
Winning standby runs pg_ctl promote
PostgreSQL exits recovery mode, creates a recovery.signal removal, generates a new timeline, and starts accepting writes.
Other standbys reconfigured
Patroni updates primary_conninfo on remaining standbys to point to the new primary. They reconnect and continue streaming WAL from the new timeline.
K8s Service updated
Operator updates postgres-rw Service selector to point to the new primary pod. New writes are immediately routed to the promoted pod.
PgBouncer reconnects
PgBouncer detects the old primary is gone and reconnects to the new host (via Service DNS). In-flight transactions fail with a connection error — apps must retry.
Old primary joins as standby (if it recovers)
When postgres-0 restarts, Patroni detects a newer timeline exists, runs pg_rewind to rewind to the failover point, then reattaches as standby.
11. Operator & Tool Comparison
| Tool | Type | DCS | Failover Time | Backup | K8s-Native | Logical Repl | Best For |
|---|---|---|---|---|---|---|---|
| CloudNativePG | Operator CRD | K8s Leases | 5–15s | Barman (S3/GCS) | Yes | Yes | New K8s deployments, CNCF-aligned |
| Patroni + etcd | HA Manager | etcd / Consul | 30–45s | pgBackRest / Barman | Partial | Yes | Existing infra, more control |
| Crunchy PGO | Operator CRD | K8s Leases | 5–20s | pgBackRest | Yes | Yes | Enterprise + OpenShift |
| Bitnami Helm | Helm Chart | None (manual) | Manual | Manual / CronJob | No | Manual | Quick dev/test setup only |
| Zalando Operator | Operator CRD | etcd (Patroni) | 30–45s | WAL-E / Spilo | Yes | Yes | Large teams using Patroni base |
12. Production Readiness Checklist
Replication & HA
3 instances across 3 zones
Primary in Zone A, sync standby in Zone B, async standby in Zone C.
Synchronous standby configured
synchronous_standby_names = 'ANY 1 (*)' for zero data loss with fallback.
Replication slots monitored
Alert if pg_replication_slots.active = false or WAL retained > 10GB.
Failover tested in staging
Run patronictl switchover monthly. Verify RTO and RPO meet SLA.
PgBouncer in transaction mode
Apps connect to PgBouncer, not Postgres directly. Max connections never hits limit.
Backup & Observability
Continuous WAL archiving to S3
WAL archived every segment (16MB). Enables PITR. Barman Cloud or pgBackRest.
Daily base backup
ScheduledBackup CRD (CloudNativePG) or pg_basebackup via CronJob.
Restore tested monthly
Restore latest backup to a test namespace. Verify row counts and schema integrity.
Alerts on replication lag
Alert if standby lag > 30s (warning) or > 5min (critical). Use pg_stat_replication.
PodDisruptionBudget
minAvailable: 2 — prevents draining more than 1 PG node during rolling node upgrades.