What is the single most important criterion when evaluating conversational AI?

Grounded resolution rate on your own data, scored by your own auditor. Every other criterion is a proxy. If a vendor refuses to be measured on your tickets, that refusal is the answer.

How long should a fair evaluation take?

Six to ten weeks if the buyer runs it. Two weeks to write the rubric, four to six for a pilot on real data, one to two for commercial close. The evaluations that finish in two weeks are the ones the vendor ran.

RFP or working session?

Both, in that order. The RFP filters out vendors who cannot write down their answers. The working session reveals whether the product behaves the way the writing claims. Buyers who skip the second step regret it inside a quarter.

Evaluate Conversational AI Platforms: A CRO's View

A scene from last quarter

A VP of CX at a European retailer sat across the table from me with three vendor decks fanned out in front of her. She tapped the middle one and said, "This is the demo my team loved. I just have no idea if it would survive a Tuesday morning in our contact centre." That is the right question, asked late. By the time you have three decks on the table, the demo has already done its work. The job of an evaluation is to disprove the demo, not to admire it.

I run sales at Certainly. I sit in three or four of these evaluations a week, on both sides of the table. The buyers who win share one habit. They take the demo away from the vendor. Everything below is how they do it.

What the demo is actually for

A vendor demo is a sales asset. That is not a complaint, it is a description. Its purpose is to compress a year of capability into thirty minutes of confidence. Every vendor on your shortlist, including mine, has a demo that looks great. If you choose between vendors by demo quality, you are choosing a marketing team.

The buyers who lose this evaluation outsource the success criteria to the vendor who demos first. The first deck sets the standard, and every other vendor is scored against a script they did not write. Halfway through the second meeting the buyer says, "Can you show us the part where it handles the refund the way the other one did?" That is the sound of an evaluation already lost.

The fix is unglamorous. Write your own success criteria before the first demo. Bring them to every meeting. Score every vendor against your criteria, never theirs.

The five questions that decide every deal

Strip a year of enterprise evaluations down to the questions that actually moved the contract. There are five.

1. Will it answer correctly on our data. Not on a sample. On a representative slice of your real conversations, scored by your auditor, with the vendor blind to which conversations were chosen. Every vendor will offer you a curated demo on synthetic data. Decline. The conversation that decides your deployment is the one neither of you can rehearse for.

2. Will it act, not just answer. A platform that talks beautifully and cannot touch your systems is a search engine with a chat interface. Ask each finalist to fire one real action during the pilot. A CRM lookup. A ticket creation. An order status. If the integration story collapses in the pilot, it will collapse in production.

3. What happens when it should not handle the conversation. Force a handoff. Time it. Watch the human arrive. If the human lands on a blank screen, the vendor has not solved the part that matters. Every executive I have ever spoken to remembers a bad handoff. None of them remember a good answer.

4. Who owns this in eighteen months. The most expensive line in the contract is the one nobody discusses. Day-two ownership. Ask the vendor to describe the operating team you will need a year from launch. If their answer is, "Our professional services team handles that," you are not buying a platform, you are leasing a relationship.

5. What does it cost to leave. Not the contract exit clause. The real cost. Can you export your prompts, your conversation data, your retrieval index. Can you point the same connectors at a different model provider. If the answer to either is no, the platform is renting you back your own work.

Every other question on a typical RFP is a variation on one of these five. Reorganise your scoring around them and the noise drops.

What to cut from your scoring

Three criteria I watch executives over-weight every quarter.

Logo lists. Every vendor has logos. Logos tell you who bought, not who renewed, and renewal is the only metric that distinguishes a real customer from a pilot.

Vendor benchmark numbers. The only resolution rate that matters is the one your auditor measures on your data. A vendor quoting a 70 percent containment rate from a different customer is a fact about that customer, not a prediction about you.

Builder UI polish. The operator builds the agent. The customer never sees the builder. A beautiful builder is pleasant for your team and irrelevant to the buyer of the outcome.

AI Readiness Score

How ready is your team for AI?

6 quick questions. Get a personalised score and action plan.

Try the AI Readiness Score

1000+ agents deployed worldwide · 4.8 on G2

The pilot protocol

Here is the protocol I would run if I were on your side of the table. I have watched it kill bad decisions and make good ones boring.

Same data, every vendor. Pull two to five hundred real conversations from the last quarter. Redact what you have to. Hand the identical set to every finalist. If a vendor will not work with your data, the evaluation is over.

Success metric written down first. Decide before the pilot starts what passing looks like. Grounded resolution above X. Hallucination rate below Y. Handoff under Z seconds. Write the numbers on a single page. Sign it internally before any vendor sees the data. This single page has saved more evaluations than every RFP I have ever read.

Blind scoring. Your QA team or an external auditor scores the outputs without knowing which vendor produced them. The vendor does not score its own pilot. This is the step buyers skip most often and regret most reliably.

A breakage test. Ten deliberately hard inputs sent to every vendor. The customer who changes their mind mid-conversation. The policy question with a recent exception. The multi-language thread. The request the agent should refuse. How a platform fails matters more than how it succeeds, because in production it will do both.

One real integration. Pick one of your systems and ask each finalist to fire a live action against it during the pilot. Latency, error rate, what happens on retry. Half the vendors that pass the conversation test fail this one.

A handoff stopwatch. Trigger a handoff. Start a timer. The human agent picks up. Read what they see. If they see a blank thread, you have your answer.

Six steps. Six weeks. Most enterprises try to skip three of them and then spend the next year wondering why the deployment underperformed.

The pricing conversation nobody wants to have

Per seat, per conversation, per resolution, per token. Every pricing model is defensible and every one has a failure mode. Per seat under-rewards the automation you bought the platform for. Per conversation punishes you for growing. Per resolution depends on a definition of resolution that almost never survives an audit. Per token couples your bill to the model provider's next price change.

The question is not which model is cheapest in year one. It is which model still makes sense in year three, at three times the volume, with a different model provider underneath. Ask every vendor to run the math at your year-three volume. Ask them in writing. Most of them will not, which is its own data point.

What the board wants to see

When this goes to the board, the slide that closes the conversation has three columns. Vendor. Scored outcome on your data. Three-year all-in cost. Everything else is appendix. Executives who walk in with this slide finish the discussion in one meeting. Executives who walk in with a feature comparison finish it in three, and the third meeting is the one where someone says, "Let's revisit the requirements."

When to skip all of this

Two cases. First, narrow and reversible. One channel, one use case, a contract you can exit in ninety days. Pilot one platform, learn, reevaluate. The cost of the evaluation exceeds the cost of the wrong choice. Second, an incumbent already covers four of the five questions on adjacent work and the marginal cost of adding the new use case is small. Switching vendors for a ten percent improvement is rarely the trade the board wants you to make.

Case Studies

See how teams deploy 1000+ agents worldwide

Real results from Feastables, Fintiba, Quad Lock, and more.

Try the Case Studies

1000+ agents deployed worldwide · 4.8 on G2

The one test I would run this week

If you only do one thing before your next vendor meeting, do this. Write down, on a single page, the five answers your board needs by the time you sign. Carry that page into every demo. Hand it to every vendor at the end of the call and ask them to respond in writing in three business days. The vendors who can will move to the pilot. The vendors who cannot have answered the question for you.

If you want a second pair of eyes on that page, book a working session. I will not show you a demo. I will read your page, tell you where it is sharp and where it is hand-waving, and run our platform against your hardest tickets in front of you. The strongest signal a vendor can give you is that we welcome the audit. The second-strongest is what we say when we lose one.

How to Evaluate a Conversational AI Platform Without Getting Sold a Demo