Offline evals vs. real stored-state gates
Offline evaluations measure model quality on held-out prompts. They do not prove your agent run created the ticket, ledger entry, or entitlement record your operators expect. AgentSkeptic read-only verification compares structured tool activity to your authoritative stores at decision time so gaps surface before customers do.
Use /integrate to wire structured NDJSON observations into your environment, then use /pricing when you need commercial metering for API-backed verification runs in CI.
What to do next
- Start first-run on
/integratebefore you expand eval coverage. - Compare bundled proof at
/examples/wf-missing. - Read
/pricingfor commercial packaging when eval infrastructure needs API keys. - Review
/securityfor how verification credentials are scoped.