Night Mode LabsBlue Book
Platform Practices

Observability and SLOs

Observability is not a dashboard project. It is the ability to understand service behavior during change, failure, and growth. Use service-level objectives to connect telemetry to user impact.

Telemetry baseline

Every production service should emit:

  • Structured logs with request, trace, tenant, and deployment identifiers.
  • Metrics for latency, traffic, errors, saturation, and business-critical events.
  • Distributed traces for request paths across service boundaries.
  • Deployment markers so incidents can be correlated with releases.
  • Synthetic checks for critical user journeys.

OpenTelemetry is the default instrumentation choice for new work because it avoids vendor lock-in and standardizes logs, metrics, and traces.

SLO design

Start with a small SLO set:

  • Availability for critical entry points.
  • Latency for user-facing or dependency-sensitive paths.
  • Correctness for workflows where success matters more than response time.
  • Freshness for data pipelines, event streams, and reporting systems.

Define the measurement window, error budget, alert threshold, owner, and runbook for each SLO. Avoid alerting on every metric. Alert on symptoms that require human action.

Incident readiness

A reliable platform includes:

  • Service catalog ownership and escalation paths.
  • Runbooks for common failures and rollback procedures.
  • On-call schedules with clear severity definitions.
  • Post-incident reviews focused on systems, not blame.
  • Error-budget policy that can slow feature work when reliability suffers.

Tooling examples

On this page