// product

Every layer of your agent. Observed, by default.

Your Business instruments the LLM call graph the way Datadog instruments the HTTP one. Drop in three lines of SDK and you get distributed traces, live evals, prompt versioning, replay, and cost attribution — across every model, tool, and retry.

// 01 · distributed tracing

OTel-native traces for every span your agent emits.

Your Business tracing is built on the OpenTelemetry semantic-conventions standard, extended with LLM-specific span attributes (token counts, retries, eval scores). One auto-instrumentation call wraps every supported SDK; manual spans take a one-line decorator.

Search by anything you tag — customer ID, feature, model version, eval score, environment. Live trace tail shows production traffic in under 800 milliseconds, end-to-end.

Automatic instrumentation for OpenAI, Anthropic, Gemini, Bedrock, Azure
Tool-call, retry, and streaming-chunk spans natively
OTLP-compatible — export to Datadog, Honeycomb, or Grafana Tempo
End-to-end client-to-server propagation across services
p99 ingest 3.6s · 14B+ spans / month at steady state

tracing.ts

import { trace, span } from "@yourbusiness/sdk";

// auto-wraps every supported provider

await trace.install({

  service: "sales-copilot",

  env: process.env.NODE_ENV

});

// custom spans need one line

await span("qualify_lead", async ({ tag }) => {

  tag({ customer_id, feature: "qualify" });

  return await qualify(input);

});

● root: qualify_lead     204ms

  ├─ llm: gpt-4.1-mini    132ms · $0.0042

  └─ tool: crm.fetch       62ms · cache HIT

// 02 · live evals

Eval every production response, not just CI.

Run a panel of LLM-as-judge, classifier, and assertion-based evals on every span — or sample at any rate. Alerts trigger when the score for a slice drops below threshold. Evals run async in our cluster, so they don't add latency to your user request.

Out-of-the-box panels for factuality, helpfulness, brand voice, refusal accuracy, and PII leakage. Or write your own and version-control them in your repo.

10+ ready panels, each tested across 50,000 production traces
Custom evals as code — TypeScript, Python, or YAML
Scored at the span level — drill into the failing tool call
Alert on PagerDuty, Slack, or webhook within 30 seconds of regression

evals.ts

import { eval } from "@yourbusiness/sdk";

// run a panel on every production response

eval("factuality_v3", {

  on: "every_response",

  model: "claude-haiku-4.5",

  alert: { score: "< 0.85", slice: "per_customer" }

});

// custom assert eval, runs in-line

eval.assert("brand_voice", ({ output }) => {

  return !output.match(/synergy|leverage/i);

});

⚠ factuality_v3 dropped to 0.79 on `qualify`

→ slack alert sent · 12s ago

// 03 · cost attribution

Token spend, broken down by anything you tag.

Per-customer, per-feature, per-model, per-retry — slice the bill any way you want. Your Business captures input and output tokens at the span level, applies your provider's price list (kept up to date for you), and rolls it up.

Spend dashboards refresh in real time. Set budget alerts per slice. Export to your finance team's warehouse via Snowflake or BigQuery.

Real-time spend by customer, model, environment, feature, retry
Provider price lists auto-updated weekly
Budget alerts at 50/80/100% of monthly target per slice
BigQuery and Snowflake exports, hourly

spend.ts

// tag once, slice forever

yourbusiness.tag(span, {

  customer_id: "acme-co",

  feature: "summarize",

  tier: "enterprise"

});

// real-time spend rollup

GET /api/spend?

  group_by=customer_id,model

  range=last_24h

{

  "acme-co": { "opus-4.7": $284.10, "haiku-4.5": $3.20 },

  "globex": { "opus-4.7": $612.45 }

}

// 04 · replay

Reproduce any production trace locally, in 30 seconds.

Replay rebuilds the exact tool sandbox, model parameters, and inputs from any span you've recorded — and runs them again locally, against any model. Step through the trace, swap the model, change the prompt, and rerun.

The fastest way we've found to debug an agent regression is to replay the exact failing trace, drop a breakpoint in the tool call, and inspect the inputs the LLM actually saw — not what your test suite thinks it saw.

Tool sandbox restored from span attributes — no live HTTP calls
Swap model, prompt, or temperature on replay to compare outputs
Side-by-side diff view of original vs. replayed responses
Generate a CI test from any replayed trace in one click

terminal · replay

$ yourbusiness replay trc_a1b2c3d4

→ rebuilding tool sandbox…

→ restoring 4 tool stubs from snapshot

→ replaying with claude-opus-4.7

● root: draft_followup       3.4s

  ├─ llm: claude-opus-4.7    2.1s · 4,820 tok

  ├─ retry: tool_call_failed → 1 retry · ok

  └─ eval: brand_voice        PASS · 0.94

$ yourbusiness replay trc_a1b2c3d4 \

  --model gpt-4.1 --diff

→ side-by-side diff at localhost:7373

// 05 · prompt registry

Version, A/B, and roll back prompts without a deploy.

Treat prompts the way you treat config: separate from code, versioned, diffed, and rollable. Push a new prompt, canary it to 10% of traffic, watch the eval scores, promote or roll back — all from the dashboard or a CLI command.

Every prompt change is an artifact with a hash, a diff, an author, and a deploy event. If something breaks, revert is one click.

Prompts versioned with semver-style tags · canary by traffic %
Diff view shows token-level changes between versions
Eval scores per version surfaced in the picker — pick by quality
One-click rollback with audit log entry

prompts.ts

// fetch the current canary version

const prompt = await

  yourbusiness.prompt("qualify_lead@v18", {

    vars: { account, owner }

  });

// canary 10% to v19, keep 90% on v18

yourbusiness.deploy("qualify_lead", {

  canary: { version: "v19", traffic: "10%" },

  stable: "v18"

});

● v19 · factuality 0.93 (+0.04) · helpful 0.91

→ promoted to stable · 12s

// at-a-glance

Where each capability lives in your stack

Pick the layers you need. Your Business's modules ship together but compose independently — most teams start with tracing only and turn on evals after the first week.

CapabilityHobbyProEnterprise

Distributed tracingincludedincludedincluded

Live evals (5 panels)—includedincluded

Custom evals—includedincluded

Cost attribution dashboardbasicfullfull

Replay & datasets—includedincluded

Prompt registry—includedincluded

Trace retention3 days30 days365 days

SSO / SAML / SCIM——included

Self-hosted / VPC deploy——included

SOC 2 · HIPAA · BAA—SOC 2all three

// in production

How a team ships an agent improvement on Your Business

A four-step loop most of our customers run weekly. Each step lives in one place and never leaves the platform.

// 01

Notice the regression

An eval slice drops below threshold. Slack alert fires. The dashboard surfaces the failing traces, grouped by feature.

// 02

Replay the worst one

Click into a failing trace, hit "replay locally," and step through the tool sandbox in your IDE. Reproduce the bug in 30 seconds.

// 03

Fix in the prompt registry

Edit the prompt, push as `v19`, promote to canary at 10%. The dashboard shows the new version's eval scores in real time.

// 04

Promote and freeze a test

Once `v19` beats `v18` on every panel for 200 traces, promote to stable. Convert the original failing trace into a regression test in CI.

How Your Business is different

A short list of design decisions that distinguish us from logging tools, eval platforms, and the LLM-observability category at large.

⊳

OTel from day one

Every span is OpenTelemetry-shaped. Pipe to Datadog, Honeycomb, or Grafana Tempo with no re-instrumentation. We're a complement to your existing stack, not a fork.

⊕

Evals on every response

Most platforms eval in CI only. We run them on production traffic, async, in our cluster, with no added user latency. Bugs surface in production where they actually hurt.

Per-span pricing, not per-seat

Add the whole company. We charge by ingest volume, the way logs are priced. No "but only three engineers can see the dashboard" trap.

★

Replay, not just trace

You can read a trace anywhere. You can replay one only here. Reproducing the bug is most of the debugging — we make that the cheapest step.

∿

Built by ex-observability folks

The team is from Datadog, Honeycomb, and Sentry. We've shipped these primitives before, just for HTTP and crashes. Now we're doing it for LLM call graphs.

↯

SDK-first, dashboard-second

We don't make you click through ten dashboards to get value. The SDK does most of the work. The UI is for the humans who didn't write the code.