Skip to content

Latest commit

 

History

History
241 lines (182 loc) · 9.49 KB

File metadata and controls

241 lines (182 loc) · 9.49 KB

Logging, Metrics, and Distributed Tracing

You can't operate a system you can't see. Observability is the discipline of producing the right signals so a human (or a query) can answer "what happened" without re-deriving it from code.


1. The Three Pillars

What Question it answers Granularity Cost
Logs Discrete events "What happened?" One per event High
Metrics Numeric aggregates "How much / how often / how fast?" Aggregated Low
Traces Causal chain of work across services "Where did time go?" Per request Medium

The right answer is "all three" — they're complementary, not interchangeable. Modern observability platforms (OTel, Datadog, Honeycomb, Grafana stack) support all three with shared correlation IDs.

2. Logging

What logs are good for:

  • High-detail debugging info from real systems.
  • Audit trails (who did what, when).
  • Diagnosing one specific incident with grep/regex.

What logs are bad for:

  • Aggregating "how many login failures in the last hour" — that's metrics.
  • Following a request across services — that's tracing.
  • Anything where the volume is so high you can't afford to write it all.

spdlog is the de facto C++ default; it's fast and feature-complete. See spdlog. Other options: glog, Boost.Log, Quill (fastest async).

#include <spdlog/spdlog.h>
#include <string>

int main() {
    int user_id = 42;
    std::string ip = "10.0.0.1";
    int attempt = 2, max = 5;
    std::string url = "https://api.example.com/login";
    std::string err = "connection refused";

    spdlog::info("user {} signed in from {}", user_id, ip);
    spdlog::warn("retry {} of {} for url={}", attempt, max, url);
    spdlog::error("db query failed: {}", err);
}

Levels — be disciplined:

  • TRACE — truly fine-grained dev-time diagnostics. Off in prod.
  • DEBUG — useful in dev / staging. Off in prod (or sampled).
  • INFO — significant events: startup, config, completed phases.
  • WARN — something abnormal but recoverable.
  • ERROR — something failed; the user or operator needs to know.
  • FATAL — about to die.

Log everything at the right level, then configure what's emitted per environment.

3. Structured Logging

Plain text logs are hard to query. Structured logs (JSON or key-value) are queryable:

2024-01-15T12:34:56Z INFO request_id=abc123 user_id=42 latency_ms=87 endpoint=/api/v1/cart status=200

vs

{"ts":"2024-01-15T12:34:56Z","lvl":"INFO","request_id":"abc123","user_id":42,"latency_ms":87,"endpoint":"/api/v1/cart","status":200}

Now you can grep, filter, aggregate. Tools like Loki, Elasticsearch, Vector ingest these natively.

spdlog supports JSON formatting; OpenTelemetry's logger emits structured records. Standardize on field names across your services (request_id, user_id, trace_id, etc.) so cross-service queries work.

4. Metrics

Numeric time-series. Designed for aggregation, dashboards, alerts:

#include <prometheus/counter.h>
#include <prometheus/histogram.h>
#include <prometheus/registry.h>

int main() {
    prometheus::Registry registry;

    auto& login_family = prometheus::BuildCounter()
        .Name("login_total")
        .Help("Total login attempts")
        .Register(registry);
    auto& login_total = login_family.Add({});
    login_total.Increment();

    auto& latency_family = prometheus::BuildHistogram()
        .Name("login_latency_seconds")
        .Help("Login latency")
        .Register(registry);
    auto& login_latency = latency_family.Add(
        {}, prometheus::Histogram::BucketBoundaries{0.001, 0.01, 0.1, 1.0, 10.0});
    login_latency.Observe(0.087);
}

Metric types (Prometheus / OTel):

  • Counter. Monotonically increasing. requests_total, errors_total. Reset only on restart. Compute rate (rate(x[5m])) for "per second."
  • Gauge. Goes up and down. queue_depth, temperature_celsius, memory_bytes.
  • Histogram. Buckets observations into ranges. Best for latencies; supports percentiles via histogram_quantile.
  • Summary. Like histogram but pre-computes quantiles client-side. Avoid unless you specifically need this.

Cardinality discipline (see §7).

5. Distributed Tracing

A trace is a tree (or DAG) of spans. Each span is one unit of work — a request handler, a DB query, an RPC call. Spans carry timing, attributes, and a link to the parent span.

[span: HTTP /checkout                                    ] 250ms
  ├─[span: auth.verify        ] 8ms
  ├─[span: cart.fetch (DB)    ] 14ms
  ├─[span: payments.charge (RPC) ] 220ms
  │   ├─[span: bank.api (RPC)] 200ms
  │   └─[span: ledger.write  ] 12ms
  └─[span: email.send (queue)] 5ms

You see exactly where the 250ms went. Traces correlate across services because the parent passes the trace ID forward in headers (W3C Trace Context: traceparent).

When tracing pays off:

  • Microservices — request flows through 5+ services.
  • Async work — a request creates async tasks; tracing links them.
  • Tail-latency debugging — sample slow traces to see where they spend time.

When it's overkill: monoliths with low call depth, batch jobs, embedded.

6. OpenTelemetry C++

OpenTelemetry is the open standard for observability instrumentation. Vendor-neutral — point the exporter at your platform of choice.

#include <opentelemetry/trace/provider.h>
#include <opentelemetry/exporters/otlp/otlp_grpc_exporter_factory.h>
#include <opentelemetry/sdk/trace/simple_processor_factory.h>
#include <opentelemetry/sdk/trace/tracer_provider_factory.h>

namespace trace_api = opentelemetry::trace;
namespace trace_sdk = opentelemetry::sdk::trace;
namespace otlp     = opentelemetry::exporter::otlp;

int main() {
    auto exporter  = otlp::OtlpGrpcExporterFactory::Create();
    auto processor = trace_sdk::SimpleSpanProcessorFactory::Create(std::move(exporter));
    auto provider  = trace_sdk::TracerProviderFactory::Create(std::move(processor));
    trace_api::Provider::SetTracerProvider(provider);

    auto tracer = trace_api::Provider::GetTracerProvider()->GetTracer("checkout");
    auto span   = tracer->StartSpan("http_handler");
    auto scope  = tracer->WithActiveSpan(span);
    // ... work ...
    span->SetAttribute("status_code", 200);
    span->End();
}

OTel components:

  • API: what application code calls (Tracer, Meter, Logger).
  • SDK: implements the API; configures samplers, processors, exporters.
  • OTLP: the wire protocol (over gRPC or HTTP).
  • Collector: a separate process that receives OTLP, transforms, fans out to backends.

The Collector pattern means your app doesn't need vendor-specific code; the Collector translates to Jaeger/Zipkin/Tempo/Datadog/etc.

7. Cost and Cardinality

Observability is expensive. The two cost drivers:

Volume. Logging at TRACE for every request will consume terabytes per day. Sample.

Cardinality. Each unique combination of metric name + labels = one time series. Storage and memory cost is proportional to active series count.

#include <prometheus/counter.h>
#include <prometheus/registry.h>
#include <string>

int main() {
    prometheus::Registry registry;
    auto& family = prometheus::BuildCounter()
        .Name("requests_total")
        .Help("Total requests")
        .Register(registry);

    std::string uid       = "user-9173455";
    std::string plan_tier = "pro";

    // BAD: user_id has millions of values -- one new time series per user
    family.Add({{"user_id", uid}}).Increment();

    // OK: bounded labels -- 3 plan tiers = 3 series
    family.Add({{"plan", plan_tier}}).Increment();
}

Heuristics:

  • Don't put unbounded values in metric labels. User IDs, request IDs, exact paths. Use logs/traces for those.
  • Cap dashboards' label cardinality. Anything over ~10k series per metric is an alarm bell.
  • Sample traces. 1–10% in production is usually enough; sample 100% of errors and slow requests.
  • Tier logging. ERROR/WARN to long-term storage; INFO/DEBUG to short retention.

8. Quick Reference

Question Tool
What did this user see at 14:32? Logs (filter by user_id, time)
What's the p99 of /checkout? Metrics (histogram + quantile)
Why was that one request slow? Trace (find the slow span)
Are we erroring more this hour? Metrics (rate of errors_total)
What's the queue backlog? Metrics (gauge)
Was config X applied at startup? Logs (INFO startup line)
Did service A actually call service B? Trace (span hierarchy)

Cross-link them: every log line should include trace_id if a trace is active, every span should include the request_id consistent with logs, every error metric should have a label that joins to a log query.

References