// product

Every layer of your agent. Observed, by default.

Your Business instruments the LLM call graph the way Datadog instruments the HTTP one. Drop in three lines of SDK and you get distributed traces, live evals, prompt versioning, replay, and cost attribution — across every model, tool, and retry.

// 01 · distributed tracing

OTel-native traces for every span your agent emits.

Your Business tracing is built on the OpenTelemetry semantic-conventions standard, extended with LLM-specific span attributes (token counts, retries, eval scores). One auto-instrumentation call wraps every supported SDK; manual spans take a one-line decorator.

Search by anything you tag — customer ID, feature, model version, eval score, environment. Live trace tail shows production traffic in under 800 milliseconds, end-to-end.

tracing.ts
import { trace, span } from "@yourbusiness/sdk";

// auto-wraps every supported provider
await trace.install({
  service: "sales-copilot",
  env: process.env.NODE_ENV
});

// custom spans need one line
await span("qualify_lead", async ({ tag }) => {
  tag({ customer_id, feature: "qualify" });
  return await qualify(input);
});

root: qualify_lead 204ms
  ├─ llm: gpt-4.1-mini 132ms · $0.0042
  └─ tool: crm.fetch 62ms · cache HIT
// 02 · live evals

Eval every production response, not just CI.

Run a panel of LLM-as-judge, classifier, and assertion-based evals on every span — or sample at any rate. Alerts trigger when the score for a slice drops below threshold. Evals run async in our cluster, so they don't add latency to your user request.

Out-of-the-box panels for factuality, helpfulness, brand voice, refusal accuracy, and PII leakage. Or write your own and version-control them in your repo.

evals.ts
import { eval } from "@yourbusiness/sdk";

// run a panel on every production response
eval("factuality_v3", {
  on: "every_response",
  model: "claude-haiku-4.5",
  alert: { score: "< 0.85", slice: "per_customer" }
});

// custom assert eval, runs in-line
eval.assert("brand_voice", ({ output }) => {
  return !output.match(/synergy|leverage/i);
});

factuality_v3 dropped to 0.79 on `qualify`
slack alert sent · 12s ago
// 03 · cost attribution

Token spend, broken down by anything you tag.

Per-customer, per-feature, per-model, per-retry — slice the bill any way you want. Your Business captures input and output tokens at the span level, applies your provider's price list (kept up to date for you), and rolls it up.

Spend dashboards refresh in real time. Set budget alerts per slice. Export to your finance team's warehouse via Snowflake or BigQuery.

spend.ts
// tag once, slice forever
yourbusiness.tag(span, {
  customer_id: "acme-co",
  feature: "summarize",
  tier: "enterprise"
});

// real-time spend rollup
GET /api/spend?
  group_by=customer_id,model
  range=last_24h

{
  "acme-co": { "opus-4.7": $284.10, "haiku-4.5": $3.20 },
  "globex": { "opus-4.7": $612.45 }
}
// 04 · replay

Reproduce any production trace locally, in 30 seconds.

Replay rebuilds the exact tool sandbox, model parameters, and inputs from any span you've recorded — and runs them again locally, against any model. Step through the trace, swap the model, change the prompt, and rerun.

The fastest way we've found to debug an agent regression is to replay the exact failing trace, drop a breakpoint in the tool call, and inspect the inputs the LLM actually saw — not what your test suite thinks it saw.

terminal · replay
$ yourbusiness replay trc_a1b2c3d4

→ rebuilding tool sandbox…
→ restoring 4 tool stubs from snapshot
→ replaying with claude-opus-4.7

root: draft_followup 3.4s
  ├─ llm: claude-opus-4.7 2.1s · 4,820 tok
  ├─ retry: tool_call_failed → 1 retry · ok
  └─ eval: brand_voice PASS · 0.94

$ yourbusiness replay trc_a1b2c3d4 \
  --model gpt-4.1 --diff

→ side-by-side diff at localhost:7373
// 05 · prompt registry

Version, A/B, and roll back prompts without a deploy.

Treat prompts the way you treat config: separate from code, versioned, diffed, and rollable. Push a new prompt, canary it to 10% of traffic, watch the eval scores, promote or roll back — all from the dashboard or a CLI command.

Every prompt change is an artifact with a hash, a diff, an author, and a deploy event. If something breaks, revert is one click.

prompts.ts
// fetch the current canary version
const prompt = await
  yourbusiness.prompt("qualify_lead@v18", {
    vars: { account, owner }
  });

// canary 10% to v19, keep 90% on v18
yourbusiness.deploy("qualify_lead", {
  canary: { version: "v19", traffic: "10%" },
  stable: "v18"
});

v19 · factuality 0.93 (+0.04) · helpful 0.91
promoted to stable · 12s
// at-a-glance

Where each capability lives in your stack

Pick the layers you need. Your Business's modules ship together but compose independently — most teams start with tracing only and turn on evals after the first week.

CapabilityHobbyProEnterprise
Distributed tracingincludedincludedincluded
Live evals (5 panels)includedincluded
Custom evalsincludedincluded
Cost attribution dashboardbasicfullfull
Replay & datasetsincludedincluded
Prompt registryincludedincluded
Trace retention3 days30 days365 days
SSO / SAML / SCIMincluded
Self-hosted / VPC deployincluded
SOC 2 · HIPAA · BAASOC 2all three
// in production

How a team ships an agent improvement on Your Business

A four-step loop most of our customers run weekly. Each step lives in one place and never leaves the platform.

// 01

Notice the regression

An eval slice drops below threshold. Slack alert fires. The dashboard surfaces the failing traces, grouped by feature.

// 02

Replay the worst one

Click into a failing trace, hit "replay locally," and step through the tool sandbox in your IDE. Reproduce the bug in 30 seconds.

// 03

Fix in the prompt registry

Edit the prompt, push as `v19`, promote to canary at 10%. The dashboard shows the new version's eval scores in real time.

// 04

Promote and freeze a test

Once `v19` beats `v18` on every panel for 200 traces, promote to stable. Convert the original failing trace into a regression test in CI.

How Your Business is different

A short list of design decisions that distinguish us from logging tools, eval platforms, and the LLM-observability category at large.

OTel from day one

Every span is OpenTelemetry-shaped. Pipe to Datadog, Honeycomb, or Grafana Tempo with no re-instrumentation. We're a complement to your existing stack, not a fork.

Evals on every response

Most platforms eval in CI only. We run them on production traffic, async, in our cluster, with no added user latency. Bugs surface in production where they actually hurt.

$

Per-span pricing, not per-seat

Add the whole company. We charge by ingest volume, the way logs are priced. No "but only three engineers can see the dashboard" trap.

Replay, not just trace

You can read a trace anywhere. You can replay one only here. Reproducing the bug is most of the debugging — we make that the cheapest step.

Built by ex-observability folks

The team is from Datadog, Honeycomb, and Sentry. We've shipped these primitives before, just for HTTP and crashes. Now we're doing it for LLM call graphs.

SDK-first, dashboard-second

We don't make you click through ten dashboards to get value. The SDK does most of the work. The UI is for the humans who didn't write the code.

See it on your own traffic.

Free up to 50,000 spans a month. No credit card. Drop in three lines of SDK and you'll see your first trace in the dashboard in under a minute.

Start free → No credit card $ npm i @yourbusiness/sdk