How to tell if your AI is actually working: 6 metrics that matter

A chatbot that frustrates customers until they give up has a 100% deflection rate. That's the number most AI vendors lead with: "% handled by AI," "% deflected." It goes up when the bot is failing. So deflection is the wrong gauge for whether your AI is working. The right question is narrower and harder to fake. Is it helping customers, or just hiding them from you?

An owner really wants to know two things. Is the AI helping people or burying them, and is it making money or just making the dashboard look busy? Six numbers answer that honestly. Each one comes paired with the vanity metric it replaces and a guard number you have to read beside it.

Here's the trap in one line. Deflection counts conversations that stopped. It says nothing about whether the problem got solved. A bot can run its containment rate to 100% by being so useless that everyone rage-quits to email. The slide stays green. The customers are gone.

The reframe

So we'll walk six numbers that don't do that. Start with the one you can flip and watch. The card below is a working model of your own funnel: toggle it between a typical inbox and an AI that's pulling its weight, and watch each gauge recolor against its own healthy range. The arrows are illustrative on purpose. The real, sourced benchmark lives on each gauge, one click away.

The six-gauge scorecard

Illustrative figures to show direction — not benchmarks, not a Cura claim. The real, sourced number sits on each gauge’s “what good looks like” line.

First-response time

under 5 minhealthy

Under 5 min on messaging apps; live-chat CSAT peaks near 85% at a 5-10s first reply.

paired with: resolution rate

Resolution rate

61%healthy

Mature RAG containment ~55-65%; but only 14% of self-service issues are fully resolved (Gartner).

paired with: re-contact rate

Booking conversion

54%healthy

Home-services call-to-booking averages ~42%; 50-70% is excellent (ServiceTitan).

paired with: no-show rate

No-show rate

11%watch

US single-specialty ~6.8% (MGMA, 2023); global outpatient average ~23.5%.

paired with: reschedule rate

Reactivation rate

18%healthy

Strong win-back reactivates ~10-30% of lapsed customers.

paired with: follow-up cap

Escalation rate

22%healthy

Healthy band ~15-30%. A low number with low CSAT means people are trapped, not helped.

paired with: CSAT

Two gauges break the “higher is better” habit on purpose. No-show is lower-is-better. Escalation is a band: too low, with low satisfaction, means customers are stuck in a loop, not helped.

Flip it a few times. Notice that two gauges don't get greener as the number climbs. No-show is lower-is-better, and escalation is a band, not a target. Hold that thought. It's the whole point of the back half of this piece.

The six numbers, one at a time

What to measure, and what good looks like

1 · First-response timegood: under 5 min on chat

Minutes from a customer's message to your first real reply. This is the cheapest number to move and the most punishing to ignore. The classic MIT/InsideSales study (Dr. James Oldroyd, 2007, often misattributed to HBR, so cite it right) found that reaching a lead within five minutes instead of thirty made you about 21x more likely to qualify it and roughly 100x more likely to even reach them. Live-chat satisfaction peaks around 84.7% when the first reply lands inside 5–10 seconds. The window that kills most service businesses isn't business hours. It's the after-hours and weekend stretch when the clock, not the team, is the problem. That's also where a missed message quietly becomes a lost regular.

2 · Resolution rategood: verified, not just contained

The share of conversations the AI actually finishes, and the number most worth being suspicious of. Containment benchmarks look great: mature retrieval-augmented bots land around 55–65%, best-in-class hit 70–80% (Gartner's 2025 tech survey). Then the floor drops out. Gartner surveyed 5,728 customers in 2024 and found only 14% of self-service issues were fully resolved. Even for problems people called "very simple," just 36% got solved without a human. Containment counts the conversations that stopped. Resolution counts the ones that ended solved. Track the gap with re-contact rate: if more than a quarter of "resolved" customers message again within 48 hours, your deflection number is fiction.

3 · Booking conversiongood: 50–70% of inbound (services)

The percentage of inbound conversations that turn into a confirmed booking. This is the money metric, and service-business numbers are specific. Call-to-booking for home services averages about 42% (ServiceTitan's data via NextPhone); 50–70% is excellent, 70%+ is top-tier, and a Chicago plumbing firm, Black Diamond, runs 77%. Now the leak that makes it urgent. Across 130,175 calls at 45 home-services businesses, 74.1% went unanswered. Of the ones answered, only about 35% of agents ever asked for the booking. Answer everything, then actually ask. That's the check_availability → create_booking loop doing two boring jobs a tired front desk skips at 6pm.

4 · No-show rategood: low — and lower beats louder

Booked, then didn't arrive. US single-specialty practices averaged 6.81% in 2023 (MGMA); the global outpatient average is far rougher, around 23.5%. The losses are widely estimated in the tens of billions a year in US healthcare alone, but treat that as an estimate, not a hard figure. Piling on more reminders barely moves it. What moves it is a confirmation that needs a reply and a reschedule that takes one tap, which is why reminders alone aren't enough and why a no-show is really a follow-up problem.

5 · Reactivation rategood: ~10–30% of lapsed customers

The share of lapsed customers you bring back. Almost nobody measures it, which is exactly why it's worth measuring. Strong win-back campaigns reactivate 10–30% of dormant customers, and the prize is lopsided: repeat customers are roughly 21% of buyers but drive about 44% of revenue. The risk is becoming the business that spams its own list. Cura caps follow-ups at two touches per seven days, so a win-back nudge stays a nudge.

6 · Escalation rategood: a band (~15–30%), not zero

The share the AI hands to a human. Read this one backwards. A low handoff rate is a red flag when satisfaction is also low: it means people are stuck in a loop, not getting helped. Healthy sits around 15–30% depending on how messy your questions are. Gartner found 53% of customers go straight to a human, and only 20% of those did it because self-service looked unavailable. People often just want a person. A clean handoff is a pass, not a failure, which means you grade escalation next to CSAT or not at all.

The rule that ties all six together

Never read a number alone. That's the whole discipline. Every healthy-looking metric has a twin that tells you whether it's lying.

Speed without resolution is just fast wrong answers. Deflection without CSAT is customers hiding from you in a nicer font. A low escalation rate without satisfaction is a trap with the door painted shut. A booking count without a no-show rate is a calendar full of ghosts. The vanity number and the guard number always travel together, and a dashboard that shows you one but not the other isn't neutral. It's flattering you.

A number without its guard number is a guess in a nicer font.

This is where the plumbing matters more than the pitch. These six are checkable inside Cura not because the dashboard is louder, but because every draft passes a presend judge before it sends, and every tool call is logged. You can read why a conversation resolved, escalated, or booked, instead of trusting a single green percentage on a slide. Resolution isn't a vibe; it's a row you can open. Escalation isn't a mystery; it's a routed case with a reason attached. If you'd rather not hand the keys over on day one, start in Draft mode and watch the numbers before you let it send on its own.

Why this test matters more every quarter

Gartner expects agentic AI to autonomously resolve 80% of common customer-service issues by 2029, at roughly 30% lower cost. The same Gartner expects more than 40% of agentic-AI projects to be scrapped by the end of 2027. Both are true. The technology is real and arriving fast, and most deployments will still fail, usually because nobody measured whether the bot was helping or just deflecting. The Freshworks 2025 benchmark (187M+ tickets across 10,551 organizations) shows AI cutting first-response time from over six hours to under four minutes. Same report: AI deflects 45%+ of queries. One of those numbers is worth bragging about. You now know which.

So audit your AI the way you'd audit a new hire after a month. Not "how many tickets did it touch" but "did the customers it touched get what they came for." These six numbers are that review. Run them on any vendor, this one included.

Common questions

What is a good AI customer service resolution rate?

For containment — the share of conversations the AI finishes without a human — mature retrieval-augmented deployments average around 55–65%, best-in-class reach 70–80%, and rule-based bots fall below 35% (Gartner 2025 Customer Service Technology Survey). But containment isn't resolution. Gartner's August 2024 survey of 5,728 customers found only 14% of self-service issues are fully resolved. A "good" resolution rate is one you've verified: track whether the customer came back within about 48 hours. A re-contact rate above 25% means your reported resolution number is overstated.

What's the difference between containment and resolution rate?

Containment counts conversations that stopped reaching a human. Resolution confirms the customer's problem was actually solved. They diverge badly: a bot that frustrates people until they give up has a 100% containment (deflection) rate while solving nothing. The honest check is re-contact rate — if someone messages again about the same thing within 48 hours, it was contained but not resolved. Containment is the number on the vendor's slide. Resolution is the number on your bank statement.

Is a low escalation rate good?

Not by itself. A low handoff rate is a red flag when satisfaction is also low — it usually means customers are trapped in automation loops instead of getting help. A healthy escalation rate is roughly 15–30% depending on how complex your inquiries are. Gartner found 53% of customers go straight to a human, and only 20% of those did so because they thought self-service was unavailable, so people often want a person. A clean handoff to a human is a pass, not a failure. Always read escalation rate next to CSAT.

What customer service metrics actually matter for an AI agent?

Six: first-response time, verified resolution rate, booking conversion, no-show rate, reactivation rate, and escalation rate. Skip raw deflection or "% handled by AI" as a headline — it rises when the bot fails. The rule underneath all six is to never read a number alone. Speed without resolution is fast wrong answers. Deflection without CSAT is hidden customers. A full calendar without a no-show rate is a calendar full of ghosts. Each good number needs a guard number beside it.

Want these six numbers pulled on your real inbox, not an illustrative card? Book a demo and we'll read them together, and show you the conversation logs underneath each one.