7 min

SLOs & observability

The SLO catalogue, metrics, and dashboards.

By the end you’ll be able to

Read the three default SLOs shipped by `@databridge/metrics-rollup`.
Know how an SLO is evaluated against a rolled-up metric.
Find the SLOs admin surface and the metrics endpoint.

DataBridge emits a steady stream of operational metric samples — `rule_run_ms` (how long a rule run took), `assistant_tokens_total` (how many tokens the assistant spent), `adapter_rows_total` (how many rows an adapter moved), and so on. `@databridge/metrics-rollup` turns that raw stream into something a dashboard can read: it buckets timestamped samples into fixed 5-minute / 1-hour / 24-hour windows and computes count / sum / min / max / avg / last per bucket.

An SLO is defined over one metric and one window. It names an `aggregate` (`avg`, `max`, `min`, `last`, `sum`), a `comparator` (`lt`, `lte`, `gt`, `gte`) and a `threshold`. `evaluateSlo` rolls the metric up over the SLO's window, reduces the buckets with the aggregate, and compares the result against the threshold. No samples at all is treated as vacuously `healthy`.

Three default SLOs ship in `DEFAULT_SLOS` (`packages/metrics-rollup/src/index.ts`): `rule-run-latency-p-ok` — `avg(rule_run_ms)` over `1h` `< 2000` ("average rule-run under 2s"); `assistant-spend-cap` — `sum(assistant_tokens_total)` over `24h` `< 5_000_000` ("daily assistant tokens under cap"); and `adapter-throughput` — `sum(adapter_rows_total)` over `1h` `>= 1` ("at least some adapter throughput hourly"). Customers can layer their own SLOs on top.

The metrics surface is documented in `docs/OPERATOR_GUIDE.md` §4.3: `GET /metrics` returns Prometheus exposition text (v0.0.4), and the key counters are `hesa_submissions_total`, `hesa_violations_total`, `hesa_signoffs_total` and `hesa_submissions_rejected_total`, plus the `hesa_last_submission_submittable` gauge. The wider observability story includes the `@databridge/observability-core` exporters (`observability-exporter-otlp-json`, `observability-exporter-prometheus`).

Dashboards are validated by `pnpm dashboards:check`, which is part of the freshness gate. If a dashboard references a metric name that no longer exists in code, the gate fails — that is what keeps observability honest as the surface evolves.

Walkthrough

1.Open the SLOs surface
The admin SLO browser lists the configured SLOs and their current evaluation. Walk through it once to see the shape.
Open SLOs
2.Tour the admin home
From the admin home you can hop across to webhooks, marketplace and waivers — every operator surface lives behind one nav.
Open admin console
3.Check the audit log
The audit log is the other half of observability: a tamper-evident record of every meaningful action.
Open audit log

Your turn

Open the admin SLOs surface and confirm you can see the configured SLOs evaluating.

Hint: Use the 'Open the SLOs surface' step above.

Knowledge check

1.Which metric does the default `rule-run-latency-p-ok` SLO evaluate?

2.How does `evaluateSlo` treat an SLO with no samples in its window?

3.What format does the API expose metrics in?

Walkthrough

Your turn

Knowledge check

Complete this lesson