Skip to main content

Observability-Driven Development

Observability is not monitoring with better branding. Monitoring tells you when something breaks. Observability lets you understand why it broke — and why it is slow, and why that one user's experience differs from everyone else's.

Abhishek Sharma· Head of Engg @ Fordel Studios
8 min read min read
Observability-Driven Development

The shift from monitoring to observability is a shift from known-unknowns to unknown-unknowns. Monitoring checks predefined conditions: is CPU above 80%? Is the error rate above 1%? Is the response time above 500ms? Observability lets you ask arbitrary questions of your system without having anticipated those questions in advance.

This distinction matters because modern distributed systems fail in ways that cannot be predicted. A request that traverses 15 microservices, 3 databases, 2 caches, and an AI model inference endpoint has thousands of potential failure modes. You cannot write a monitoring check for each one. You need the ability to trace a single request through the entire system and understand its behavior.

···

The Three Pillars (and Why They Are Not Enough)

The traditional framing of observability centers on three pillars: logs, metrics, and traces. This framing is useful but incomplete.

Logs capture discrete events. Metrics capture aggregated measurements over time. Traces capture the journey of a single request through distributed services. The insight comes from correlating all three: a metric shows latency increased, a trace pinpoints which service introduced the latency, and a log from that service reveals the root cause.

OpenTelemetry: The Convergence

OpenTelemetry (OTel) has become the standard for instrumenting applications. It provides a vendor-neutral SDK for generating traces, metrics, and logs, with exporters for every major observability backend (Datadog, Grafana, Honeycomb, New Relic, Jaeger). The key insight behind OTel is that instrumentation should be separated from the observability backend — instrument once, export to any vendor.

For AI applications, OTel is particularly valuable because AI pipelines are inherently distributed: an API request triggers RAG retrieval, prompt construction, model inference, output parsing, and response formatting — each potentially running on different services. OTel traces capture this entire chain with timing, input/output sizes, and error states at each step.

OpenTelemetry Instrumentation Priorities
  • HTTP server and client libraries (auto-instrumentation available for most frameworks)
  • Database clients — query timing, connection pool metrics, query text for slow query analysis
  • LLM API calls — model name, token count, latency, prompt/completion token breakdown
  • Queue consumers — message processing time, batch size, lag
  • Custom business logic — spans around critical code paths with domain-specific attributes

Observability for AI Systems

AI systems introduce observability challenges that traditional web applications do not have. Model inference is non-deterministic — the same input can produce different outputs, making reproduction of issues harder. Model quality degrades over time as data distributions shift. Cost is directly proportional to usage in ways that are hard to predict.

SignalWhat to TrackAlert Threshold
Inference latency (p50/p95/p99)Response time from modelp99 > 2x baseline
Token usage per requestInput + output tokensMean > 150% of baseline
Error rate by error typeRate limits, timeouts, format errorsAny error type > 1%
Output quality scoreLLM-as-judge or heuristic qualityRolling avg drops > 10%
Cost per requestToken cost + infrastructure costDaily cost > 120% of budget
Cache hit rateSemantic cache effectivenessHit rate drops below 30%

Implementing Observability-Driven Development

01
Instrument before you build features

Add OTel instrumentation to your service framework before writing business logic. Tracing should be automatic for all HTTP and database operations from day one.

02
Define SLOs before you define alerts

Service Level Objectives (99.9% availability, p95 latency < 200ms) give you a framework for deciding what matters. Alert on SLO budget burn rate, not individual metric thresholds.

03
Build debug workflows, not dashboards

A dashboard shows you that something is wrong. A debug workflow takes you from "something is wrong" to "here is why" in minutes. Design your observability around the debugging journey.

04
Correlate AI-specific signals with system signals

When model quality drops, is it because inference latency increased (timeout causing truncated responses)? Because a new model version was deployed? Because the RAG retrieval is returning irrelevant documents? Correlation is the answer.

The best engineering teams treat observability as a feature, not infrastructure. Every sprint includes observability work because every feature that ships without observability is a feature that cannot be debugged in production.
···

Three Pillars Deep Dive: Logs, Metrics, Traces

Logs, metrics, and traces are not interchangeable — they answer different questions and have different operational costs. A common mistake is over-logging while under-instrumenting with metrics and traces. The result is terabytes of log data that takes 30 seconds to query and does not answer the question "which service is responsible for this latency spike?"

Logs answer "what happened?" They are event records: a request arrived, a database query ran, an error was thrown. Logs are the highest resolution signal but also the most expensive to store and query. They are best suited for: debugging specific errors, auditing access patterns, and reconstructing the exact sequence of events during an incident.

Metrics answer "how is the system behaving over time?" They are aggregated numerical measurements sampled at intervals: request rate, error rate, latency percentiles, resource utilisation. Metrics are cheap to store (a fixed number of time series regardless of traffic volume) and fast to query. They are best suited for: alerting, capacity planning, and SLO tracking.

Traces answer "where did this request spend its time?" A trace spans a request from entry point through every service, database call, and external API it touches. Traces are essential for distributed systems where a request crosses 5-10 service boundaries — logs from each service are disconnected without a trace to link them. They are best suited for: performance profiling, identifying bottlenecks in multi-service flows, and understanding call patterns.

Signal typeStorage costQuery speedBest forRetention (typical)
LogsHigh ($0.50–3/GB/month)Slow (full-text index)Debugging specific events7–30 days hot, 90 days cold
MetricsVery low ($0.01–0.10/metric series/month)Very fast (pre-aggregated)Alerting, SLO tracking, trending1–2 years
TracesMedium ($0.05–0.50/million spans)Medium (indexed by trace ID)Latency profiling, distributed debugging3–7 days (sampled)

The practical implication: most production systems should store far fewer logs than they currently do (sample at 10-20% for routine requests) and far more metrics (metric cardinality is cheap). The question "is the payment service healthy right now?" should be answerable from metrics in under 100ms, not from a log query that takes 30 seconds. For teams managing production incidents, the ability to answer that question fast is the difference between a 2-minute MTTD and a 20-minute MTTD.

···

OpenTelemetry Instrumentation Patterns

OpenTelemetry (OTel) has become the industry standard for instrumentation — a vendor-neutral SDK that emits traces, metrics, and logs in a standard format. The CNCF graduated the project in 2021. As of OTel v1.20+ (released Q4 2024), the API and SDK are stable for traces and metrics; logs are in beta.

Auto-instrumentation handles the common cases: HTTP servers and clients, database drivers, message queue consumers and producers, gRPC calls. For Node.js, @opentelemetry/auto-instrumentations-node instruments Express, Fastify, HTTP, pg, Redis, and most common libraries by adding a single require() at process startup. For Python, opentelemetry-instrumentation-auto does the same for Flask, FastAPI, SQLAlchemy, and Redis.

01
Start with auto-instrumentation — zero code changes

Add the OTel SDK and auto-instrumentation package. Configure the exporter (OTLP to your collector). Deploy. You immediately get traces for all instrumented libraries — typically 80% of the latency in your service comes from these auto-instrumented operations.

02
Add span attributes to auto-generated spans

Auto-generated spans capture timing but not business context. Enrich them: on HTTP request spans, add customer ID and tenant ID. On database query spans, add query hash and table name. On message queue spans, add message type and queue depth. These attributes enable cross-cutting queries: "show all slow database queries for tenant X in the last hour."

03
Add manual spans for business-critical operations

Create manual spans around business logic that is not an external call — payment processing logic, fraud scoring, inventory reservation. These spans let you answer "how long does our fraud check take?" separately from "how long does the overall checkout take?"

04
Add custom metrics for business KPIs

Beyond infrastructure metrics, track business metrics as OTel counters and histograms: orders_processed_total, payment_amount_histogram, cart_abandonment_counter. These flow through the same pipeline as your infrastructure metrics and appear in the same dashboards.

  • service.name, service.version — set at SDK initialisation, enable service-level grouping
  • deployment.environment — production/staging/canary, enables environment filtering
  • user.id or tenant.id — enables per-customer latency and error analysis
  • db.statement hash — latency by query type without logging full query text (PII concern)
  • http.route — normalised route (/users/:id not /users/12345), enables per-endpoint metrics
  • error.type and error.message — structured error classification for error rate dashboards
···

Grafana + Prometheus + Tempo vs Datadog: Stack Comparison

The open-source observability stack (Grafana + Prometheus + Tempo + Loki) and Datadog represent the two dominant architectures. The choice is primarily a build-vs-buy decision with different cost profiles at different scales. For teams evaluating this in the context of CI/CD pipeline observability, note that both stacks integrate with GitHub Actions and can surface pipeline metrics alongside application metrics.

DimensionGrafana + Prometheus + Tempo + LokiDatadog
Setup time2–8 hours for basic stack, days for production-grade30 minutes to first data
Infrastructure cost (10 services, 50k req/min)$50–200/month on managed cloud (Grafana Cloud)$1,500–4,000/month (per-host + APM + logs)
Maintenance burdenMedium — Helm charts, retention tuning, scalingNone — fully managed
Trace retention (default)Configurable (typically 3–7 days)15 days (APM Enterprise: 30 days)
AlertingGrafana Alerting — powerful but complex to configureDatadog Monitors — intuitive, anomaly detection built-in
AI/LLM observabilityManual — requires custom panelsLLM Observability product (beta 2024)
Mobile APMManual SDK integrationNative iOS/Android RUM
Pricing predictabilityPredictable (storage-based)Unpredictable — cardinality spikes cause bill spikes

The cost difference narrows significantly at scale. At 500+ services, Datadog costs become very large (often $30,000-100,000/month for full APM + logs + infrastructure), while the Grafana stack scales with storage costs that grow sub-linearly. Most teams that migrate away from Datadog do so at the $5,000-10,000/month threshold where the savings justify the maintenance investment.

···

SLO-Driven Alerting: Burn Rate vs Threshold Alerts

Traditional alerting fires when a metric crosses a threshold: "error rate > 1%, page the on-call." This approach has two failure modes: it fires too early (a brief 1.1% error rate spike that self-resolves in 30 seconds pages someone at 3 AM) and too late (a sustained 0.8% error rate that never crosses the threshold but will exhaust your SLO budget in 3 days).

SLO-based alerting with burn rate alerts solves both problems. Instead of alerting on the instantaneous metric, you alert on the rate at which you are consuming your SLO error budget. A burn rate of 1.0 means you are consuming your error budget at exactly the pace your SLO defines — if sustained, you will exhaust it at the end of the period. A burn rate of 14.4 means you will exhaust a 30-day error budget in 2 hours.

Alert typeBurn rate thresholdWindowPage?What it means
Fast burn (P0)14.4x1 hourPage immediatelyError budget exhausted in < 2 hours if not resolved
Fast burn (P1)6x6 hoursPage immediatelyError budget exhausted in < 5 hours if not resolved
Slow burn warning1x3-day windowTicket (no page)On track to exhaust budget before period ends
Informational< 1xOngoingNo alertWithin SLO budget — no action needed

The multi-window approach (fast burn uses a short window, slow burn uses a long window) prevents alert storms during brief spikes while catching sustained degradations early. This is the approach documented in the Google SRE Workbook and implemented natively in Grafana's SLO feature and Datadog's SLO monitors.

···

Observability for AI Systems

Standard observability covers infrastructure and application behaviour. AI systems — LLM inference services, RAG pipelines, agent orchestration systems — require additional observability dimensions that standard OTel instrumentation does not cover by default. Model evaluation beyond benchmarks covers the evaluation pipeline that feeds into AI observability dashboards.

Token usage tracking: each LLM API call consumes prompt tokens and completion tokens at different cost rates. Track token counts as OTel histograms per (model, endpoint, customer) tuple. This enables cost allocation by customer, identification of runaway prompts, and capacity planning. A request that consistently uses 3,000 prompt tokens when the median is 500 is a signal worth investigating.

Latency distribution by model and request type: LLM inference latency is highly variable — a P50 of 400ms and a P99 of 8,000ms on the same endpoint is common. Track latency as histograms with fine-grained buckets at the high end (1s, 2s, 5s, 10s, 30s) because the tail latency is where user experience degrades.

Quality metrics as time series: if you have an automated quality evaluation running (factual accuracy, groundedness, citation fidelity), emit those scores as OTel gauges. A quality dashboard that shows accuracy trending down over a 7-day window is as important as a dashboard showing error rates trending up.

  • Token usage: prompt tokens, completion tokens, total cost per request — tracked as histograms by model
  • Latency distribution: P50, P95, P99 for LLM API calls — separate from overall request latency
  • Cache hit rate: if using semantic caching (GPTCache, LangChain cache), track hit rate and latency savings
  • Retrieval quality: for RAG systems, track retrieval count, average relevance score, and empty-retrieval rate
  • Quality metric trends: factual accuracy, citation fidelity, refusal rate — as time series gauges
  • Error classification: rate limit errors, context length errors, content policy rejections — separate counters
  • Fallback rate: how often does the system fall back to a smaller model or a cached response?
···

Cost of Observability: Storage and Sampling

Observability infrastructure has a cost that scales with traffic, and that cost is often opaque until the first large bill arrives. At 10 million requests per day, naive 100% trace sampling produces roughly 50-200 million spans per day. At $0.10-0.50 per million spans for storage and indexing, this is $5-100 per day — $1,800-36,500 per year — for trace storage alone.

Sampling strategies to control cost: head-based sampling (decide at the start of a trace whether to record it — typically 1-10% uniform sampling plus 100% sampling for error traces) is simple but discards potentially valuable low-rate error traces. Tail-based sampling (make the decision at the end of a trace after seeing the outcome — 100% of errors, 100% of traces above a latency threshold, 1% of successful fast traces) provides much better coverage of interesting events at the same storage cost. Tail-based sampling requires a collector that buffers spans — the OTel Collector supports this via the tail-sampling processor.

···

Getting Started: The Minimum Viable Observability Stack

Teams overwhelmed by the observability landscape should start with a minimal viable stack and expand as needed. For most web applications: structured JSON logs to stdout (captured by your container platform), Prometheus-compatible metrics with four golden signals (latency, traffic, errors, saturation), and a single distributed trace per request that flows through all services. This covers 90% of debugging scenarios and can be implemented in a day.

The implementation order matters: start with metrics (cheapest to implement, fastest to alert on), then logs (essential for debugging once metrics identify the problem), then traces (necessary only when you have multiple services and need to understand cross-service behaviour). Teams that start with traces often over-invest in infrastructure before they have the basic metrics that tell them where to look. A well-configured Prometheus with four dashboards (latency, error rate, traffic, saturation) provides more debugging value than a sophisticated tracing system that nobody has learned to query.

Build with us

Need this kind of thinking applied to your product?

We build AI agents, full-stack platforms, and engineering systems. Same depth, applied to your problem.

Newsletter

Enjoyed this? Get the weekly digest.

Research highlights and AI news, delivered every Thursday. No spam.

Loading comments...