The Three Pillars of Observability
Metrics
Numerical measurements over time.
CPU, memory, request counts, latency percentiles.
→ Prometheus & Mimir
Logs
Time-stamped text events emitted by services.
Errors, warnings, debug output.
→ Loki
Traces
End-to-end journey of a request across services.
Spans, latency, dependency map.
→ Tempo
Grafana is the unified visualization layer that queries all three pillars from a single UI.
Components At a Glance
Grafana
The frontend UI for everything. Dashboards, alerting, and unified querying across metrics, logs, and traces. Acts as the single pane of glass — it does not store data itself.
Prometheus
Scrapes metrics from your services via HTTP endpoints. Stores them locally in a time-series database (TSDB). Evaluates alerting rules and forwards alerts to Alertmanager. Short-term storage by default.
Mimir
Long-term, horizontally scalable metrics backend. 100% Prometheus-compatible remote write/read API. Replaces Prometheus local storage for large-scale or multi-tenant deployments. Think of it as "Prometheus as a service".
Loki
Stores and indexes logs by labels (like Prometheus does for metrics) rather than full-text indexing. Cheap and efficient. Queried using LogQL. Logs are shipped by agents (Promtail, Alloy, Fluentd).
Tempo
Stores distributed traces (spans). Compatible with OpenTelemetry, Jaeger, and Zipkin. Integrates tightly with Loki and Prometheus — click on a log line or metric spike and jump to the correlated trace.
Architecture Diagram
Data Collectors (How Data Gets In)
🔥 Metrics → Prometheus
- Services expose
/metricsendpoint (Prometheus format) - Prometheus pulls (scrapes) on a set interval
- Node Exporter, cAdvisor for infra metrics
- Client libs: Go, Python, Java, Node.js
📋 Logs → Loki
- Promtail — tails log files on each host
- Grafana Alloy — next-gen unified collector
- Fluentd / Fluent Bit — for existing pipelines
- Docker / Kubernetes log drivers
🌐 Traces → Tempo
- OpenTelemetry SDK — instrument your code
- OTel Collector — receives, processes, exports
- Jaeger client libraries (legacy)
- Zipkin (legacy), supports OTLP natively
☁️ Metrics → Mimir
- Prometheus sends via remote_write
- OTel Collector can push directly to Mimir
- Grafana Alloy can write metrics directly
- Mimir handles multi-tenancy & long-term retention
End-to-End Data Flow
Application Instrumentation
Your service exposes a /metrics endpoint, writes structured logs to stdout/file, and uses an OTel SDK to emit trace spans for each incoming request and downstream call.
Prometheus Scrapes Metrics
Every 15–60 seconds, Prometheus pulls metrics from all configured targets. It stores them locally (TSDB) for short-term querying and evaluates alerting rules. When an alert fires, it routes to Alertmanager.
Prometheus Remote-Writes to Mimir
Prometheus is configured with remote_write to forward all scraped metrics to Mimir in real time. Mimir stores them cost-effectively in object storage (S3/GCS) and serves multi-tenant PromQL queries at massive scale.
Promtail / Alloy Ships Logs to Loki
A log collector agent (Promtail or Alloy) runs alongside your services, tails log files, attaches Kubernetes/pod labels, and pushes log streams to Loki. Loki indexes only the labels — not the full text — making storage cheap.
OTel Collector Sends Traces to Tempo
The OTel Collector receives spans from your services via OTLP (gRPC/HTTP), batches them, and forwards to Tempo. Tempo stores traces in object storage indexed by Trace ID, ready for TraceQL queries.
Grafana Queries Everything
Grafana connects to Prometheus/Mimir (PromQL), Loki (LogQL), and Tempo (TraceQL) as data sources. One dashboard can show a metric spike, the correlated log lines, and the distributed trace for the failing request — all side by side.
Cross-Signal Correlation (The Magic)
Loki can link log lines to traces using the Trace ID embedded in logs. Tempo can derive metrics from traces. Grafana Explore lets you jump between signals: spot an anomaly in a metric → drill into logs → open the trace. No context switching.
Quick Reference Comparison
Prometheus vs. Mimir — When to Use Which?
🔥 Use Prometheus alone when…
- Small to medium scale (single cluster)
- Short retention is acceptable (<15 days)
- No multi-tenancy needed
- Simple setup with minimal ops overhead
☁️ Add Mimir when…
- Long-term retention (>15 days, years)
- Multi-tenant environment (SaaS, teams)
- Horizontal scaling needed (billions of series)
- High availability and global querying required
Note: Mimir does NOT replace Prometheus — Prometheus still does the scraping and alerting.
Mimir replaces Prometheus's local storage by receiving data via remote_write.
They work together, not as alternatives.
Cross-Signal Correlation in Grafana
📊→📄 Metrics to Logs
- Click a metric anomaly in Grafana
- Open Explore → switch to Loki datasource
- Same time window, same service labels auto-applied
- See which log lines caused the spike
📄→🔗 Logs to Traces
- Loki derives Trace ID from structured log fields
- Click the Trace ID link in a log line
- Grafana opens Tempo with that trace
- See the full request journey across services
🔗→📊 Traces to Metrics (RED)
- Tempo generates Rate, Errors, Duration metrics
- Service graph shows dependency topology
- Exemplars link specific metric data points to traces
- Identify which service causes latency