Your Business instruments the LLM call graph the way Datadog instruments the HTTP one. Drop in three lines of SDK and you get distributed traces, live evals, prompt versioning, replay, and cost attribution — across every model, tool, and retry.
Your Business tracing is built on the OpenTelemetry semantic-conventions standard, extended with LLM-specific span attributes (token counts, retries, eval scores). One auto-instrumentation call wraps every supported SDK; manual spans take a one-line decorator.
Search by anything you tag — customer ID, feature, model version, eval score, environment. Live trace tail shows production traffic in under 800 milliseconds, end-to-end.
Run a panel of LLM-as-judge, classifier, and assertion-based evals on every span — or sample at any rate. Alerts trigger when the score for a slice drops below threshold. Evals run async in our cluster, so they don't add latency to your user request.
Out-of-the-box panels for factuality, helpfulness, brand voice, refusal accuracy, and PII leakage. Or write your own and version-control them in your repo.
Per-customer, per-feature, per-model, per-retry — slice the bill any way you want. Your Business captures input and output tokens at the span level, applies your provider's price list (kept up to date for you), and rolls it up.
Spend dashboards refresh in real time. Set budget alerts per slice. Export to your finance team's warehouse via Snowflake or BigQuery.
Replay rebuilds the exact tool sandbox, model parameters, and inputs from any span you've recorded — and runs them again locally, against any model. Step through the trace, swap the model, change the prompt, and rerun.
The fastest way we've found to debug an agent regression is to replay the exact failing trace, drop a breakpoint in the tool call, and inspect the inputs the LLM actually saw — not what your test suite thinks it saw.
Treat prompts the way you treat config: separate from code, versioned, diffed, and rollable. Push a new prompt, canary it to 10% of traffic, watch the eval scores, promote or roll back — all from the dashboard or a CLI command.
Every prompt change is an artifact with a hash, a diff, an author, and a deploy event. If something breaks, revert is one click.
Pick the layers you need. Your Business's modules ship together but compose independently — most teams start with tracing only and turn on evals after the first week.
A four-step loop most of our customers run weekly. Each step lives in one place and never leaves the platform.
An eval slice drops below threshold. Slack alert fires. The dashboard surfaces the failing traces, grouped by feature.
Click into a failing trace, hit "replay locally," and step through the tool sandbox in your IDE. Reproduce the bug in 30 seconds.
Edit the prompt, push as `v19`, promote to canary at 10%. The dashboard shows the new version's eval scores in real time.
Once `v19` beats `v18` on every panel for 200 traces, promote to stable. Convert the original failing trace into a regression test in CI.
A short list of design decisions that distinguish us from logging tools, eval platforms, and the LLM-observability category at large.
Every span is OpenTelemetry-shaped. Pipe to Datadog, Honeycomb, or Grafana Tempo with no re-instrumentation. We're a complement to your existing stack, not a fork.
Most platforms eval in CI only. We run them on production traffic, async, in our cluster, with no added user latency. Bugs surface in production where they actually hurt.
Add the whole company. We charge by ingest volume, the way logs are priced. No "but only three engineers can see the dashboard" trap.
You can read a trace anywhere. You can replay one only here. Reproducing the bug is most of the debugging — we make that the cheapest step.
The team is from Datadog, Honeycomb, and Sentry. We've shipped these primitives before, just for HTTP and crashes. Now we're doing it for LLM call graphs.
We don't make you click through ten dashboards to get value. The SDK does most of the work. The UI is for the humans who didn't write the code.
Free up to 50,000 spans a month. No credit card. Drop in three lines of SDK and you'll see your first trace in the dashboard in under a minute.