Observability Stack

Loki · Grafana · Tempo · Mimir · Prometheus

THE LGTM STACK + PROMETHEUS

The Three Pillars of Observability

📊

Metrics

Numerical measurements over time.
CPU, memory, request counts, latency percentiles.

→ Prometheus & Mimir

📄

Logs

Time-stamped text events emitted by services.
Errors, warnings, debug output.

→ Loki

🔗

Traces

End-to-end journey of a request across services.
Spans, latency, dependency map.

→ Tempo

Grafana is the unified visualization layer that queries all three pillars from a single UI.

Components At a Glance

📈

Grafana

Visualization & Exploration

The frontend UI for everything. Dashboards, alerting, and unified querying across metrics, logs, and traces. Acts as the single pane of glass — it does not store data itself.

🔥

Prometheus

Metrics Collection & Alerting

Scrapes metrics from your services via HTTP endpoints. Stores them locally in a time-series database (TSDB). Evaluates alerting rules and forwards alerts to Alertmanager. Short-term storage by default.

☁️

Mimir

Scalable Metrics Storage

Long-term, horizontally scalable metrics backend. 100% Prometheus-compatible remote write/read API. Replaces Prometheus local storage for large-scale or multi-tenant deployments. Think of it as "Prometheus as a service".

📋

Loki

Log Aggregation

Stores and indexes logs by labels (like Prometheus does for metrics) rather than full-text indexing. Cheap and efficient. Queried using LogQL. Logs are shipped by agents (Promtail, Alloy, Fluentd).

🌐

Tempo

Distributed Tracing

Stores distributed traces (spans). Compatible with OpenTelemetry, Jaeger, and Zipkin. Integrates tightly with Loki and Prometheus — click on a log line or metric spike and jump to the correlated trace.

Architecture Diagram

Data flow (push/scrape)
Query / Read
Your Applications / Services Emit metrics endpoints · Write logs · Emit traces Log Collectors Promtail · Alloy · Fluentd OTel Collector OpenTelemetry SDK / Agent 🔥 Prometheus Scrapes metrics Evaluates alerts ☁️ Mimir Long-term metrics storage (scalable) 📋 Loki Log aggregation Label-based index 🌐 Tempo Distributed traces OTLP · Jaeger · Zipkin 📈 Grafana Dashboards · Alerting · Explore Single pane of glass — no data stored here Alertmanager Slack · Email · PagerDuty Object Storage S3 / GCS / Azure Blob scrape push logs push traces remote write alerts long-term store PromQL · LogQL · TraceQL queries correlate

Data Collectors (How Data Gets In)

🔥 Metrics → Prometheus

  • Services expose /metrics endpoint (Prometheus format)
  • Prometheus pulls (scrapes) on a set interval
  • Node Exporter, cAdvisor for infra metrics
  • Client libs: Go, Python, Java, Node.js

📋 Logs → Loki

  • Promtail — tails log files on each host
  • Grafana Alloy — next-gen unified collector
  • Fluentd / Fluent Bit — for existing pipelines
  • Docker / Kubernetes log drivers

🌐 Traces → Tempo

  • OpenTelemetry SDK — instrument your code
  • OTel Collector — receives, processes, exports
  • Jaeger client libraries (legacy)
  • Zipkin (legacy), supports OTLP natively

☁️ Metrics → Mimir

  • Prometheus sends via remote_write
  • OTel Collector can push directly to Mimir
  • Grafana Alloy can write metrics directly
  • Mimir handles multi-tenancy & long-term retention

End-to-End Data Flow

1

Application Instrumentation

Your service exposes a /metrics endpoint, writes structured logs to stdout/file, and uses an OTel SDK to emit trace spans for each incoming request and downstream call.

2

Prometheus Scrapes Metrics

Every 15–60 seconds, Prometheus pulls metrics from all configured targets. It stores them locally (TSDB) for short-term querying and evaluates alerting rules. When an alert fires, it routes to Alertmanager.

3

Prometheus Remote-Writes to Mimir

Prometheus is configured with remote_write to forward all scraped metrics to Mimir in real time. Mimir stores them cost-effectively in object storage (S3/GCS) and serves multi-tenant PromQL queries at massive scale.

4

Promtail / Alloy Ships Logs to Loki

A log collector agent (Promtail or Alloy) runs alongside your services, tails log files, attaches Kubernetes/pod labels, and pushes log streams to Loki. Loki indexes only the labels — not the full text — making storage cheap.

5

OTel Collector Sends Traces to Tempo

The OTel Collector receives spans from your services via OTLP (gRPC/HTTP), batches them, and forwards to Tempo. Tempo stores traces in object storage indexed by Trace ID, ready for TraceQL queries.

6

Grafana Queries Everything

Grafana connects to Prometheus/Mimir (PromQL), Loki (LogQL), and Tempo (TraceQL) as data sources. One dashboard can show a metric spike, the correlated log lines, and the distributed trace for the failing request — all side by side.

7

Cross-Signal Correlation (The Magic)

Loki can link log lines to traces using the Trace ID embedded in logs. Tempo can derive metrics from traces. Grafana Explore lets you jump between signals: spot an anomaly in a metric → drill into logs → open the trace. No context switching.

Quick Reference Comparison

Tool Signal Type Data Ingestion Query Language Storage Backend Key Role
Prometheus
Metrics Pull (scrape) PromQL Local TSDB (short-term) Scraper, local metrics store, alerting engine
Mimir
Metrics Push (remote_write) PromQL Object storage (S3/GCS) Long-term scalable metrics storage, multi-tenant
Loki
Logs Push (agents) LogQL Object storage (S3/GCS) Log aggregation, label-based indexing
Tempo
Traces Push (OTLP/Jaeger) TraceQL Object storage (S3/GCS) Distributed trace storage and search
Grafana
All signals Query (read-only) Per data source None — queries backends Visualization, dashboards, alerting UI, explore

Prometheus vs. Mimir — When to Use Which?

🔥 Use Prometheus alone when…

  • Small to medium scale (single cluster)
  • Short retention is acceptable (<15 days)
  • No multi-tenancy needed
  • Simple setup with minimal ops overhead

☁️ Add Mimir when…

  • Long-term retention (>15 days, years)
  • Multi-tenant environment (SaaS, teams)
  • Horizontal scaling needed (billions of series)
  • High availability and global querying required

Note: Mimir does NOT replace Prometheus — Prometheus still does the scraping and alerting. Mimir replaces Prometheus's local storage by receiving data via remote_write. They work together, not as alternatives.

Cross-Signal Correlation in Grafana

📊→📄 Metrics to Logs

  • Click a metric anomaly in Grafana
  • Open Explore → switch to Loki datasource
  • Same time window, same service labels auto-applied
  • See which log lines caused the spike

📄→🔗 Logs to Traces

  • Loki derives Trace ID from structured log fields
  • Click the Trace ID link in a log line
  • Grafana opens Tempo with that trace
  • See the full request journey across services

🔗→📊 Traces to Metrics (RED)

  • Tempo generates Rate, Errors, Duration metrics
  • Service graph shows dependency topology
  • Exemplars link specific metric data points to traces
  • Identify which service causes latency

Signal Volume at High Throughput

4k
requests / sec
8k
requests / sec

Estimated signal volume generated per second

📄 Log lines (avg 2 lines/req) 8k – 16k / s
🔗 Trace spans (100% sample, 5 services) 20k – 40k / s
📊 Metric scrapes (cardinality-dependent) high cardinality risk
☁️ Remote writes to Mimir proportional to series count

Key insight: Observability signal volume is not 1:1 with TPS. A single request can produce 2–5 log lines, 5–20 spans across services, and updates to dozens of metric histograms. At 8k TPS this compounds fast.

Bottleneck Analysis by Component

📋 Loki — Most Likely to Break First

Log Aggregation · Ingester memory + label cardinality
CRITICAL
Why it bottlenecks

Write path: Distributor → Ingester (in-memory) → Chunks → Object Storage. At 16k log lines/s the ingester heap pressure is severe. The real killer is label cardinality explosion — each unique label set creates a separate log stream held in memory.

# BAD — unique per request → millions of streams {request_id="abc-123", user_id="u-456"} # GOOD — bounded cardinality → manageable streams {service="api", env="prod", pod="api-7f9b"}
Mitigations
  • Use structured log fields, not labels, for high-cardinality values
  • Scale ingester replicas horizontally (WAL-backed)
  • Enable Loki distributed mode (separate read/write paths)
  • Reduce log verbosity at app level — log aggregates, not per-request lines
  • Use log sampling for DEBUG-level output (keep errors 100%)
  • Tune ingester chunk target size and flush interval

🌐 OTel Collector — Single Point of Congestion

Trace ingestion · CPU + network saturation at 8k TPS
CRITICAL
Why it bottlenecks

The architecture uses a single OTel Collector as the centralized trace receiver. At 8k TPS with 5 services per request = 40k spans/second funneled to one process. The collector must receive, batch, process (sample/filter), and forward — all in-memory. A single instance saturates its CPU and memory quickly.

Mitigations
  • Deploy OTel Collector as a DaemonSet (one per node) — services talk to local collector
  • Add a gateway collector layer for tail-based sampling decisions
  • Enable tail-based sampling: keep 100% errors, 1–5% successes
  • Use head-based sampling for latency-sensitive paths
  • Tune batch processor: send_batch_size, timeout, queue_size
# OTel Collector fleet pattern Node Agent → receives spans locally (no network hop) ↓ Gateway Pool → tail sampling, fan-out to Tempo

🔥 Prometheus — Cardinality at Scale

Local TSDB · High series count → memory exhaustion
HIGH
Why it bottlenecks

Each unique label combination = one time series held in memory. At 8k TPS:
100 URL paths × 5 HTTP methods × 10 status codes = 5,000 series per instance.
With 50 service replicas = 250k active series.
Prometheus begins struggling past ~5–10M series. Each scrape cycle also blocks the TSDB head.

Mitigations
  • Drop high-cardinality labels at scrape time via metric_relabel_configs
  • Create recording rules to pre-aggregate before storing
  • Shard Prometheus scrape targets across multiple instances
  • Reduce histogram bucket count on http_request_duration_seconds
  • Consider Grafana Alloy to push directly to Mimir, bypassing Prometheus scrape

🌐 Tempo — WAL & Object Storage Write Throughput

Distributed Traces · Write-ahead log I/O at 100% sampling
MEDIUM
Why it bottlenecks

Tempo writes all incoming spans to a WAL on disk before flushing blocks to object storage. At 8k TPS × 5 spans = 40k spans/s, the WAL write throughput and subsequent S3/GCS flush become I/O-bound. Object storage also has per-PUT rate limits that can throttle Tempo's block uploads.

Mitigations
  • Tail-based sampling (reduce 8k TPS to 80–400 traces/s stored)
  • Use SSDs for WAL disk — not network-attached volumes
  • Scale Tempo distributors and ingesters horizontally
  • Tune max_block_duration and flush_all_on_shutdown
  • Use S3 multipart upload tuning for large block flushes

📡 Log Collectors (Promtail / Alloy) — Disk I/O & Network

Agent-side · File tailing throughput and back-pressure
MEDIUM
Why it bottlenecks

At 16k log lines/s, agents must read from log files (disk I/O), parse and label them, then push to Loki over the network. If Loki's ingester is slow (back-pressure), the agent's internal queue fills and it either drops lines or stalls the tailing loop — causing cascading delays.

Mitigations
  • Run collectors as DaemonSet — each node tails only its own logs
  • Tune Promtail/Alloy batch size and queue capacity
  • Use Grafana Alloy over Promtail — better back-pressure handling
  • Enable log compression before shipping (reduces network I/O ~70%)
  • Pre-filter noisy log sources at agent level

Bottleneck Priority Order

1
Loki ingester (cardinality + memory)
Hits first at ~2–4k TPS
2
OTel Collector (single instance CPU)
Hits at 4–6k TPS (100% sampling)
3
Prometheus TSDB (high cardinality)
Depends on label breadth
4
Tempo WAL (disk I/O)
Only critical without sampling
5
Log Collectors (disk + network)
Manageable with DaemonSet
6
Mimir (designed for this scale)
Rarely the bottleneck

Recommended Actions by Scale

<1k

Baseline — Default config is fine

  • Single Prometheus + single Loki ingester
  • Single OTel Collector + Promtail DaemonSet
  • 100% trace sampling acceptable
1k–4k

Scaling begins — First optimizations needed

  • Audit and fix label cardinality in Loki streams
  • Enable Prometheus recording rules for heavy queries
  • Switch to OTel Collector DaemonSet
  • Introduce head-based trace sampling (10–20%)
  • Enable log compression in Promtail/Alloy
4k–8k

High traffic — Horizontal scaling required

  • Loki distributed mode (separate distributor / ingester / querier)
  • Prometheus sharding across multiple instances; Mimir as unified query backend
  • OTel Collector gateway with tail-based sampling (keep errors 100%, success <5%)
  • Tempo distributor + ingester scaled horizontally; SSD WAL mandatory
  • Consider replacing Promtail with Grafana Alloy for better back-pressure
>8k

Extreme scale — Architecture shift

  • Replace Prometheus scrape with Grafana Alloy push directly to Mimir (eliminates pull bottleneck)
  • Loki with dedicated object storage writers; consider Kafka ingest buffer in front of Loki
  • Adaptive trace sampling — dynamically adjust rate based on error budget
  • Grafana Enterprise or Grafana Cloud for managed scaling of the LGTM backend

Key Architecture Shift at High Scale: Grafana Alloy

Grafana Alloy (successor to the Grafana Agent) is the key simplification at high scale. It collapses Promtail + OTel Collector into a single agent per node, eliminating multiple fan-in bottlenecks.

Before (default stack)
Promtail → Loki OTel Agent → OTel Collector → Tempo Prometheus scrape → remote_write → Mimir 3 separate agent processes per node
After (Alloy at high scale)
Grafana Alloy (DaemonSet) ├─ scrapes metrics → Mimir (push) ├─ tails logs → Loki (push) └─ receives OTLP → Tempo (push) 1 agent process per node

Alloy also supports adaptive metric collection — it can reduce scrape frequency for stable metrics and increase it for volatile ones, lowering overall TSDB pressure proportional to actual signal change.

Step 0 — Add Helm Repositories

helm repo add grafana https://grafana.github.io/helm-charts helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update

All five components live in two repos. Run helm search repo grafana to see available charts and versions.

Deployment Modes (Loki, Mimir, Tempo)

DEV / SMALL

Monolithic

All components run as a single process. Simplest setup. Good up to ~20 GB logs/day. One Deployment, one PVC. Minimal resource overhead.

helm install loki grafana/loki \ --set deploymentMode=SingleBinary \ --set loki.commonConfig.replication_factor=1
MEDIUM

Simple Scalable (SSD)

Splits into read, write, and backend targets. Each scales independently. Up to ~1 TB logs/day. Being deprecated before Loki 4.0 — skip for new setups.

helm install loki grafana/loki \ --set deploymentMode=SimpleScalable
PRODUCTION

Microservices / Distributed

Each component (distributor, ingester, querier…) is a separate Deployment or StatefulSet. Fine-grained scaling. Recommended by Grafana Labs for production.

helm install loki grafana/loki \ --set deploymentMode=Distributed

Same pattern applies to Mimir and Tempo — use grafana/mimir-distributed and grafana/tempo-distributed for production deployments. They share the same microservices philosophy.

Kubernetes Resources per Component

Component K8s Kind Typical Replicas CPU Request Memory Request Notes
Prometheus StatefulSet 1–2 500m 2 Gi Created via PrometheusOperator CR
Node Exporter DaemonSet 1 per node 100m 128 Mi Included in kube-prometheus-stack
Loki Distributor Deployment 2+ 250m–500m 512 Mi–1 Gi Receives & validates log writes
Loki Ingester StatefulSet 3+ 500m–1000m 2–4 Gi In-memory buffer; needs PVC for WAL
Loki Querier Deployment 2+ 500m 1 Gi Reads from object storage
Loki Compactor StatefulSet 1 200m 512 Mi Deduplicates & manages retention
Mimir Distributor Deployment 2–12 2 4 Gi No CPU limit by design (avoid throttle)
Mimir Ingester StatefulSet 3–6+ 2 4 Gi (limit 12 Gi) Needs SSD PVC; core scaling unit
Mimir Store-Gateway StatefulSet 3+ 2 4 Gi Serves historical data from object store
Tempo Distributor Deployment 2+ 200m 256 Mi Receives spans from collectors
Tempo Ingester StatefulSet 3+ 500m 512 Mi WAL on disk; SSD preferred
Grafana Deployment 1–2 250m 512 Mi Stateless UI; dashboards in ConfigMaps
Grafana Alloy DaemonSet 1 per node 200m 256 Mi Replaces Promtail + OTel Agent

Object Storage Setup

🧪 Local Dev — MinIO

  • Deploy MinIO as a StatefulSet in the cluster
  • S3 API on port 9000, Console on 9001
  • Create buckets: loki, mimir, tempo
  • Store credentials in a Kubernetes Secret
  • Use helm install minio minio/minio or the Grafana bundled option
helm install minio minio/minio \ --set rootUser=admin \ --set rootPassword=password123 \ --set buckets[0].name=loki \ --set buckets[1].name=mimir \ --set buckets[2].name=tempo

☁️ Production — Cloud Object Storage

  • AWS S3: Use IAM role via IRSA (pod identity) — no static keys
  • GCS: Workload Identity + service account annotation
  • Azure Blob: Managed Identity or connection string in Secret
  • One bucket per component is the recommended pattern
  • Enable versioning and lifecycle policies for retention management
# Example Loki values.yaml for S3 loki: storage: type: s3 s3: region: us-east-1 bucketnames: my-loki-bucket s3ForcePathStyle: false

Ports Reference

🔥 Prometheus

HTTP / PromQL API9090
Node Exporter9100
AlertManager9093

📋 Loki

HTTP API / Push3100
gRPC9095
Memberlist (gossip)7946

☁️ Mimir

HTTP API8080
gRPC9095
Memberlist (gossip)7946

🌐 Tempo

HTTP API3200
OTLP gRPC4317
OTLP HTTP4318
Jaeger gRPC14250
Zipkin HTTP9411
Memberlist (gossip)7946

📈 Grafana

HTTP UI3000

🗄️ MinIO

S3 API9000
Console UI9001

Grafana Alloy — The One Agent to Rule Them All

Deploy Grafana Alloy as a DaemonSet so one agent pod runs on every node. It replaces Promtail, OTel Agent, and Prometheus node-level scraping in one binary.

What Alloy collects on each node
alloy (DaemonSet on every node) ├─ /var/log/pods/** → Loki :3100 ├─ cAdvisor metrics → Mimir :8080 ├─ node /metrics → Mimir :8080 └─ OTLP spans recv → Tempo :4317
Install via Helm
helm install alloy grafana/alloy \ -f alloy-values.yaml # alloy-values.yaml controller: type: daemonset alloy: configMap: content: | loki.write "default" { endpoint { url = "http://loki:3100" } }

HostPath mounts (/var/log, /var/lib/docker) give Alloy direct access to container logs without any log driver changes. Alloy also handles back-pressure from Loki — if the ingester is slow, Alloy queues and retries instead of dropping.

Recommended Install Order

1

Namespace + Object Storage

Create monitoring namespace. Deploy MinIO (dev) or configure cloud bucket credentials as Secrets. All components need storage first.

2

kube-prometheus-stack (Prometheus + Grafana + AlertManager)

Installs Prometheus Operator, Prometheus, AlertManager, and Grafana in one chart. Includes Node Exporter DaemonSet and kube-state-metrics. This is your metrics foundation.

helm install kube-prom prometheus-community/kube-prometheus-stack \ -n monitoring \ --set prometheus.prometheusSpec.remoteWrite[0].url=http://mimir:8080/api/v1/push
3

Mimir

Deploy distributed Mimir. Point Prometheus remote_write at Mimir's distributor service. Mimir becomes the long-term metrics backend.

helm install mimir grafana/mimir-distributed \ -n monitoring -f mimir-values.yaml
4

Loki

Deploy Loki in distributed mode. Loki ingesters form their own hash ring via memberlist on port 7946.

helm install loki grafana/loki \ -n monitoring -f loki-values.yaml
5

Tempo

Deploy Tempo distributed. Tempo listens on OTLP gRPC :4317 and HTTP :4318. Stores traces to object storage.

helm install tempo grafana/tempo-distributed \ -n monitoring -f tempo-values.yaml
6

Grafana Alloy (DaemonSet)

Deploy Alloy last so it can reach Loki, Mimir, and Tempo endpoints. Configure it to tail pod logs, scrape node metrics, and forward OTLP spans.

helm install alloy grafana/alloy \ -n monitoring --set controller.type=daemonset
7

Configure Grafana Data Sources

Add Prometheus/Mimir, Loki, and Tempo as data sources in Grafana. Use provisioning ConfigMaps to automate this in GitOps workflows.

# datasources.yaml (provisioned via ConfigMap) datasources: - name: Loki type: loki url: http://loki-gateway:3100 - name: Mimir type: prometheus url: http://mimir-nginx:8080/prometheus - name: Tempo type: tempo url: http://tempo-query-frontend:3200

What is the Hash Ring?

Loki, Mimir, and Tempo are all distributed systems. When you have 6 ingester pods, how does a distributor know which ingester should receive a given log stream or metric time series? The answer is the consistent hash ring.

The ring is a circular 32-bit integer space (0 → 4,294,967,295). Each ingester instance claims a set of tokens (random points) on this ring. When data arrives, its labels/tenant/ID are hashed to a position on the ring, and the data is routed to the ingester that owns the nearest token clockwise.

Ingester A tok 1200 tok 3800 Ingester B tok 2200 tok 2900 Ingester C tok 3400 tok 500 clockwise hash(stream) = 1050 → routes to A Hash Ring 0 → 4,294,967,295 RF = 3 | Quorum = 2

Each ingester owns multiple tokens (shown as colored dots). Data hashes to a position; the nearest clockwise token wins. With RF=3, the next 2 ingesters also receive a copy.

Write Path Through the Ring

1

Distributor receives write

A log line arrives at the Loki distributor (or a metric sample arrives at the Mimir distributor). The distributor validates limits and computes the hash of the stream's label set (or metric labels).

2

Ring lookup → N ingesters

The hash maps to a position on the ring. The distributor finds the next N ingesters clockwise (N = replication factor, default 3). These are the target ingesters for this write.

3

Parallel write to all N ingesters

The distributor sends the write in parallel to all 3 ingesters simultaneously (Dynamo-style). It does NOT wait for all 3 — just for quorum.

4

Quorum achieved → success

With RF=3, quorum = floor(3/2)+1 = 2. As soon as 2 ingesters confirm the write, the distributor returns success. The 3rd ingester writes in the background. One ingester can be down without affecting writes.

Rings in Each Component

📋 Loki Rings

Ingester Ring — routes log stream writes from distributors to the correct ingesters based on stream label hash.

Distributor Ring — distributors track each other for HA write deduplication.

Compactor Ring — shards compaction jobs so only one compactor owns a given chunk.

UI: /distributor/ring, /ingester/ring

☁️ Mimir Rings

Ingester Ring — routes metric time series writes. Hash of metric labels determines target ingesters.

Store-Gateway Ring — shards which historical blocks each store-gateway instance serves from object storage.

Compactor Ring — coordinates block compaction to avoid race conditions.

Ruler Ring — distributes alert rule evaluation across ruler replicas.

-ingester.ring.* | -store-gateway.sharding-ring.*

🌐 Tempo Rings

Ingester Ring — routes trace spans by Trace ID hash. All spans of one trace land on the same ingester.

Distributor Ring — distributor coordination.

Compactor Ring — shard block compaction work.

Metrics-Generator Ring — optional; shards span-to-metrics derivation.

UI: /ingester/ring, /compactor/ring

Memberlist — The Gossip Protocol Behind the Ring

The ring state (who owns which tokens) must be shared across all instances. Grafana uses memberlist — a gossip protocol — to propagate this information without any central coordinator. Instances gossip with random peers, so information spreads exponentially fast.

Gossip messages
JOIN → "I exist, here are my tokens" LEAVE → "I'm shutting down gracefully" PING → heartbeat every few seconds UPDATE → "my token set changed" # Propagation: Differential gossip → sends only recent diffs to random subset of peers Pull-push sync → full state exchange with one random peer ensures convergence
Kubernetes config
# All components use port 7946 # They discover peers via: 1. DNS SRV lookups (Headless Service → pod IPs) 2. Pod label selector memberlist.join = pod:// ... 3. Static IPs (simple setups) # In values.yaml: loki: memberlist: service: publishNotReadyAddresses: true

Why port 7946? All three components (Loki, Mimir, Tempo) use port 7946/TCP for memberlist by default. In Kubernetes, you need a headless Service exposing this port so pods can discover each other for ring formation.

What Happens When a Pod Joins or Leaves?

+ New Ingester Joins (scale-up)

  • Pod starts, generates random token values
  • Registers tokens in the ring via memberlist JOIN
  • State: JOININGACTIVE
  • Distributors update their ring copy and begin routing some writes to the new pod
  • Only ~1/N fraction of data rebalances (consistent hashing advantage)
  • WAL segments stream to new owner for in-flight data

Ingester Leaves (scale-down or crash)

  • Graceful: pod enters LEAVING state, flushes chunks to object store, then exits
  • Crash: other pods stop receiving heartbeats; after heartbeat_timeout the instance is marked UNHEALTHY
  • Distributors reroute writes to the remaining ring members
  • Replication factor ensures data is not lost (quorum copies exist)
  • Compactor eventually reconciles any inconsistencies from object storage

Ring Instance States

JOINING

Pod is starting up. Tokens registered but not yet ready to serve reads. Other members are aware it exists.

ACTIVE

Fully operational. Receives writes and serves reads. Normal healthy state.

LEAVING

Pod is shutting down gracefully. Flushing in-memory data to object storage before removing tokens from ring.

UNHEALTHY

Pod stopped heartbeating. Marked dead after heartbeat_timeout. Traffic rerouted to healthy replicas.

Monitor ring health by visiting the component's HTTP UI at /ingester/ring, /distributor/ring, or /compactor/ring. You'll see each instance, its state, last heartbeat, and the tokens it owns. In Kubernetes: kubectl port-forward svc/loki-ingester 3100 -n monitoring then browse to localhost:3100/ingester/ring.

Replication Factor & Quorum Math

Formula
Quorum = floor(RF / 2) + 1 RF = 1 → Quorum = 1 (no fault tolerance) RF = 2 → Quorum = 2 (both must succeed) RF = 3 → Quorum = 2 (1 failure tolerated) ✓ RF = 5 → Quorum = 3 (2 failures tolerated)
Config
# Loki values.yaml loki: commonConfig: replication_factor: 3 # Mimir values.yaml mimir: ingester: ring: replication_factor: 3

RF=3 is the production default — it means you can lose 1 ingester pod with zero data loss and zero write interruption. Your StatefulSet replicas must be ≥ RF. For RF=3, run at least 3 ingester replicas spread across availability zones.

KV Store Backends for the Ring

RECOMMENDED
Memberlist (gossip)

  • No external dependencies — self-contained
  • Decentralized peer-to-peer discovery
  • Eventually consistent (converges quickly)
  • Default in all Helm charts
  • Works well for single-cluster deployments
-ring.store=memberlist

ALTERNATIVE
etcd

  • External etcd cluster required
  • Strong consistency (linearizable reads)
  • Good for multi-cluster or federated setups
  • More operational overhead
  • Can become a bottleneck at very high churn
-ring.store=etcd -etcd.endpoints=etcd:2379

LEGACY
Consul

  • Supported for backwards compatibility
  • External Consul cluster required
  • Grafana Labs migrated away from Consul → memberlist in 2022
  • Use memberlist for new deployments
-ring.store=consul -consul.hostname=consul:8500

TL;DR: Use memberlist. Grafana Labs themselves migrated from Consul to memberlist in production. It requires no extra infrastructure and handles pod churn in Kubernetes naturally via DNS-based peer discovery.