Skip to main content

Offline evals vs. real stored-state gates

Offline evaluations measure model quality on held-out prompts. They do not prove your agent run created the ticket, ledger entry, or entitlement record your operators expect. AgentSkeptic read-only verification compares structured tool activity to your authoritative stores at decision time so gaps surface before customers do.

Use /integrate to wire structured NDJSON observations into your environment, then use /pricing when you need commercial metering for API-backed verification runs in CI.

What to do next

  • Start first-run on /integrate before you expand eval coverage.
  • Compare bundled proof at /examples/wf-missing.
  • Read /pricing for commercial packaging when eval infrastructure needs API keys.
  • Review /security for how verification credentials are scoped.