12 questions to ask any AI agent vendor before you sign
The 12-question checklist that separates AI agent vendors who have shipped from vendors who have demoed.
A strong demo proves almost nothing. The real signal is whether the vendor can answer operational questions without hand-waving.
You sat through the demo. It looked great. Six months in, you cannot get a straight answer on what the agent is actually allowed to do, how it fails, or how you roll back a bad run. The two public reference points already on the record show why that gap matters: Air Canada was held liable for a chatbot that confidently invented a bereavement policy, while OpenAI's own Operator launch describes a system that intentionally pauses and hands control back to the user for sensitive actions. One vendor shipped a confident system without operating boundaries. The other shipped operating boundaries and called it a feature. This post is the 12-question checklist that tells you which one you are buying.
By the end of this post, you will know how to interrogate a vendor on:
- Permissions, approval gates, and tool boundaries
- Grounding, uncertainty handling, and live-state verification
- Failure modes, rollback paths, and audit trails
- Tenant scope, operator observability, and how the vendor thinks about trust
1. What exactly is the agent allowed to do?
Ask for the action taxonomy in writing: read-only, draft-only, approved-write, and fully blocked. If the vendor cannot hand you that table for your deployment, the permission model is a prompt, not a system.
2. How are risky actions gated?
"The prompt tells the model to ask first" is not a gate. Ask which actions are blocked at the system level until a human approves, and which actions can cause external side effects without review.
3. How does the system verify live-state claims?
When the agent says "there are no new replies" or "this record was updated," ask what it read to know that. Systems that ground every live assertion in a fresh read are very different from systems that summarise stale context and hope.
4. What happens when the agent is unsure?
Every serious system has a defined uncertainty posture: escalate, preview, refuse, or guess. You want a vendor who can tell you which one fires in which situation — and show it firing in a trace.
5. How are tools isolated and controlled?
Tools are the real operating surface. Ask what tools exist, how narrow each one is, whether inputs are validated, and whether two conflicting tools can run on the same turn. The safest agents have the fewest tools with the clearest boundaries.
6. What does failure handling look like?
Most buyers ask what the system can do; few ask how it fails. Ask which errors retry, which errors halt execution, whether there is loop detection and a circuit breaker, and what the user sees when something goes wrong.
7. How is context managed over long workflows?
Enterprise workflows run across days, threads, and stakeholders. Ask what context is preserved across turns, what is compressed, and how the system avoids acting on stale state — because reliability decays with conversation length if the answer is vague.
8. Can you audit what happened?
After an incident, can you reconstruct the run — request, tools called, order, what changed, what the user approved, what evidence supported the output? If the vendor cannot produce that timeline on demand, incident response will be guesswork.
9. What rollback paths exist?
Buy reversibility, not just success paths. Ask whether changes can be undone, workflows halted mid-run, bad runs isolated, and partial executions recovered. The more consequential the workflow, the more this question matters.
10. How is user and tenant scope enforced?
Ask how user identity and tenant scope are passed into every tool call, and how the agent is prevented from crossing boundaries when context bleeds across sessions. This is where multi-tenant agents quietly leak.
11. What is observable by operators, not just engineers?
Support and operations teams need human-readable run summaries, not log files. Ask whether a non-engineer can inspect a run, see why the agent paused, and explain it to a customer without paging the platform team.
12. How does the vendor think about trust?
Ask directly: "What are the moments where you intentionally slow the system down to protect user trust?" Vendors who have shipped real agents answer this in plain language — approval gates, grounding rules, escalation paths. Vendors who have shipped demos pivot back to model quality.
What you will hear from vendors who fail this list
Three predictions for the conversations you are about to have:
- On question 9, they will pivot to model quality. When you ask about rollback, you will get a paragraph about how the model rarely makes mistakes. That is the tell. Rollback is a system property, not a model property — a vendor who answers the wrong question has not built the system.
- On question 8, the audit trail will be "the logs." Raw model logs are not an audit trail. If the vendor cannot show a single screen that reconstructs a run for a non-engineer, your incident response will run on screen-shares and guesswork.
- On question 12, the strongest vendors will name a moment they made the product slower on purpose. Operator pausing for sensitive actions is the public version of this. If the vendor cannot name their version, they have not yet had the conversation with their own users about where trust breaks.
The real lesson
An enterprise AI agent is an operational system that can change data, trigger workflows, and shape customer trust. Evaluate it like one. The 12 questions above will not just protect you from bad vendors — they will help you spot the rare system that is actually ready for production work.
If you are about to sign with an AI agent vendor, send me the deck or one-pager from your top candidate and I will tell you which three of these 12 questions they cannot answer yet. [email protected].