Home BlogHow We Built a Customer Support AI Agent That Resolves 73% of Tickets

BlogFebruary 202612 min read

How We Built a Customer Support AI Agent That Resolves 73% of Tickets

A detailed case study of how we built a customer support AI agent for a B2B SaaS company — architecture decisions, RAG pipeline design, and the path to 73% auto-resolution.

In Q4 2024, a Series A B2B SaaS company came to us with a problem: their 4-person support team was drowning in 200+ tickets per day. Response times had ballooned to 6 hours. Customer satisfaction was dropping.

They didn't need more support reps. They needed an AI agent that could handle the repetitive tier-1 tickets — instantly, accurately, and with the right brand voice.

Here's exactly how we built it, the architectural decisions we made, and what we'd do differently.

The Brief

Client: A B2B SaaS platform for mid-market companies (identity withheld under NDA)

Industry: Operations Management

Support volume: 200–250 tickets/day across email, in-app chat, and Slack

Team size: 4 support agents

Goal: Automate 50%+ of tier-1 tickets while maintaining 90%+ CSAT

Budget: $8,000 for MVP, $2,000/month ongoing

Timeline: 4 weeks to production

The Architecture Decision

Why RAG, Not Fine-Tuning

We considered both approaches (read our full RAG vs fine-tuning comparison) and chose RAG for three reasons:

Their help center changes weekly: New features, updated workflows, policy changes. Fine-tuning would require retraining every time content changed. RAG pulls the latest version automatically.
Source attribution was required: The client needed the AI to cite specific help articles so customers could read more. RAG provides this natively.
Speed to production: RAG MVP in 3 weeks. Fine-tuning would have taken 8–10 weeks and they were losing customers now.

System Architecture

Customer Message (email/chat/Slack)
    ↓
Intent Classifier (fine-tuned DistilBERT)
    ↓
┌─────────────────────────────────┐
│ Tier-1 (automatable)            │ → RAG Pipeline → AI Response → Customer
│                                 │
│ Tier-2 (needs human)            │ → Route to human agent with AI-generated context
│                                 │
│ Urgent/Sensitive                │ → Immediate escalation + alert
└─────────────────────────────────┘
    ↓
Feedback Loop → Quality Monitoring → Optimization

We used a hybrid approach: a small fine-tuned classifier for intent routing, and RAG for answer generation. This gave us the speed of classification with the accuracy and flexibility of retrieval.

Week-by-Week Build

Week 1: Data Pipeline & RAG Foundation

The first challenge was data.

The client had:

340 help center articles (in Zendesk)
12,000 historical support tickets with agent responses
Internal runbooks and SOPs (Google Docs)
Slack threads with edge case solutions

We built a data ingestion pipeline that:

Extracted content from Zendesk, Google Docs, and Slack via APIs
Cleaned and normalized text (removed HTML, fixed formatting, resolved abbreviations)
Chunked documents using semantic chunking (not fixed-size) for better retrieval
Embedded chunks using OpenAI's text-embedding-3-large
Stored vectors in Qdrant (chosen for hybrid search support)

Key decision: We used semantic chunking instead of fixed 512-token chunks. This improved retrieval relevance by 23% in our initial tests because each chunk represented a complete concept, not an arbitrary text split.

Week 2: Intent Classification & RAG Pipeline

Intent classifier: We fine-tuned DistilBERT on the 12,000 historical tickets, labeled into 5 categories:

Intent	% of Volume	Action
How-to questions	42%	Auto-resolve via RAG
Bug reports	18%	Classify → route to engineering
Account/billing	15%	Auto-resolve via RAG + API
Feature requests	12%	Log → acknowledge → route to product
Escalation/urgent	13%	Immediate human routing

RAG pipeline: For auto-resolvable tickets:

Query embedding + hybrid search (semantic + keyword) in Qdrant
Top 5 chunks retrieved and re-ranked using a cross-encoder
Retrieved context + customer message + system prompt → Claude 3.5 Sonnet
Response generated with citations (linked to specific help articles)
Confidence score calculated (if <0.7, route to human instead)

Key decision: The confidence threshold was critical. We'd rather route a ticket to a human than give a wrong answer. Starting at 0.7 and tuning down over time was the right approach.

Week 3: Multi-Channel Integration & Testing

Channel integration:

Email: Connected to Zendesk API — AI responses appear as agent replies
In-app chat: Embedded widget with real-time AI responses
Slack: Bot that monitors the support channel and responds in-thread

Testing protocol:

Fed 500 historical tickets through the system
Had human agents grade each AI response (correct, partially correct, wrong)
Measured: accuracy (correct answers), coverage (% of tickets handled), latency (response time)

Week 3 results: 61% auto-resolution, 94% accuracy on resolved tickets, 1.8s average response time.

The gap between 61% and our target of 50%+ was encouraging. But we wanted more.

Week 4: Optimization & Production Deployment

Optimizations that moved us from 61% to 73%:

Added FAQ-style Q&A pairs: We extracted the top 50 questions from historical tickets and created exact-match responses. This alone added 7% coverage.
Improved chunking for product docs: The product documentation had nested steps that were being split across chunks. We implemented parent-child chunking — small chunks for retrieval, parent chunks for context.
Conversation history: For chat channels, we included the last 3 messages as context. This handled follow-up questions like "what about the enterprise plan?" without re-explaining.
Reduced false routing: Tuned the intent classifier to better distinguish between "how do I do X" (auto-resolvable) and "X isn't working" (might need human).

Production deployment:

Deployed on AWS using our Cloud CI/CD pipeline
Set up monitoring dashboards for response quality, latency, and cost
Created escalation alerts for low-confidence responses
Documented runbooks for the client's team

Results After 90 Days

Metric	Before	After	Change
Auto-resolution rate	0%	73%	—
Average response time	6 hours	12 seconds (AI) / 45 min (human)	99.9% faster for AI
CSAT score	3.4/5	4.2/5	+23%
Support team capacity	200 tickets/day (strained)	200 tickets/day (comfortable)	Team handles 2x complexity
Cost per ticket	$4.20	$1.10	-74%
Monthly support cost	$24,000	$8,500	-65%

The support team didn't shrink. They shifted from answering "how do I export a CSV?" 40 times a day to handling complex escalations, building better documentation, and doing proactive customer outreach.

Technical Lessons Learned

1. Chunking Strategy Matters More Than Model Choice

We spent 2 days testing GPT-4o vs Claude 3.5 Sonnet vs Llama 3. The accuracy difference was minimal (±2%). But switching from fixed-size to semantic chunking improved accuracy by 12%.

Takeaway: Invest in your data pipeline, not model shopping.

2. Confidence Scoring Saves Trust

The confidence threshold prevented ~15% of tickets from getting wrong answers. Every wrong answer erodes customer trust faster than a slow human response builds it.

We calibrated the threshold by tracking false positive rates weekly and adjusting. This is part of our ongoing monitoring service.

3. Human Escalation Must Be Seamless

When the AI routes to a human, it includes:

Full conversation history
Retrieved context chunks (so the human doesn't re-search)
AI's attempted answer (marked as draft, not sent to customer)
Confidence score and reason for escalation

This cut human resolution time by 40% even for escalated tickets.

4. The Knowledge Base Is the Product

The AI is only as good as the knowledge base. We discovered that 30% of help articles were outdated or incomplete. Fixing the content improved AI accuracy more than any technical optimization.

Recommendation: Before building a support AI agent, audit your help center. Remove outdated content, fill gaps, and standardize formatting.

Architecture Deep Dive

For technical readers, here's the full stack:

AI Layer

Intent classification: Fine-tuned DistilBERT (HuggingFace, hosted on AWS SageMaker)
Embedding: OpenAI text-embedding-3-large
Vector DB: Qdrant (self-hosted, single node)
Re-ranker: Cross-encoder (ms-marco-MiniLM-L-6-v2)
LLM: Claude 3.5 Sonnet (for response generation)
Confidence: Custom scoring based on retrieval relevance + LLM self-assessment

Application Layer

Backend: Python FastAPI
Queue: Redis for async ticket processing
Database: PostgreSQL for ticket history, analytics, feedback
API integrations: Zendesk, Slack, custom chat widget

Infrastructure

Hosting: AWS ECS (Fargate) for the application, SageMaker for the classifier
CI/CD: GitHub Actions → ECR → ECS
Monitoring: Datadog for infrastructure, custom dashboards for AI metrics
Cost: ~$450/month total infrastructure

Monitoring Stack

Response quality score (LLM-as-judge, sampled daily)
Retrieval relevance metrics (MRR, NDCG)
Latency percentiles (p50, p95, p99)
Cost per ticket trending
CSAT correlation with AI vs human responses
Drift detection (monthly model evaluation)

Build vs. Buy: Why Custom Won

The client evaluated several off-the-shelf support AI tools (Intercom Fin, Zendesk AI, Ada). Here's why they chose a custom build:

Factor	Off-the-Shelf	Custom (AIqwip)
Resolution rate	30–45%	73%
Customization	Limited	Full control
Multi-channel	Usually single	Email + chat + Slack
Integration depth	Surface level	Deep (account data, billing API)
Cost at scale	$1–$3 per resolution	$0.15 per resolution
Brand voice	Generic	Fully customized

The custom build cost more upfront but paid for itself in 6 weeks through reduced support costs.

Want Similar Results?

We offer a pre-built Customer Support AI Agent that includes:

Tier-1 auto-resolution with RAG
Multi-channel support (email, chat, Slack, WhatsApp)
Human escalation with context
CSAT tracking and sentiment analysis
Knowledge base integration

Or, if you need a fully custom solution:

Book a discovery call: We'll assess your support volume, channels, and knowledge base.
MVP in 3–4 weeks: Working support AI agent, deployed and handling tickets.
Optimize over 90 days: Target 60%+ auto-resolution with ongoing monitoring and optimization.

The Founder's Guide to AI Due Diligence Before Fundraising

RAG vs Fine-Tuning: Which Approach is Right for Your AI Product?

Home BlogHow We Built a Customer Support AI Agent That Resolves 73% of Tickets

BlogFebruary 202612 min read

How We Built a Customer Support AI Agent That Resolves 73% of Tickets

A detailed case study of how we built a customer support AI agent for a B2B SaaS company — architecture decisions, RAG pipeline design, and the path to 73% auto-resolution.

They didn't need more support reps. They needed an AI agent that could handle the repetitive tier-1 tickets — instantly, accurately, and with the right brand voice.

Here's exactly how we built it, the architectural decisions we made, and what we'd do differently.

The Brief

Client: A B2B SaaS platform for mid-market companies (identity withheld under NDA)

Industry: Operations Management

Support volume: 200–250 tickets/day across email, in-app chat, and Slack

Team size: 4 support agents

Goal: Automate 50%+ of tier-1 tickets while maintaining 90%+ CSAT

Budget: $8,000 for MVP, $2,000/month ongoing

Timeline: 4 weeks to production

The Architecture Decision

Why RAG, Not Fine-Tuning

We considered both approaches (read our full RAG vs fine-tuning comparison) and chose RAG for three reasons:

Their help center changes weekly: New features, updated workflows, policy changes. Fine-tuning would require retraining every time content changed. RAG pulls the latest version automatically.
Source attribution was required: The client needed the AI to cite specific help articles so customers could read more. RAG provides this natively.
Speed to production: RAG MVP in 3 weeks. Fine-tuning would have taken 8–10 weeks and they were losing customers now.

System Architecture

Customer Message (email/chat/Slack)
    ↓
Intent Classifier (fine-tuned DistilBERT)
    ↓
┌─────────────────────────────────┐
│ Tier-1 (automatable)            │ → RAG Pipeline → AI Response → Customer
│                                 │
│ Tier-2 (needs human)            │ → Route to human agent with AI-generated context
│                                 │
│ Urgent/Sensitive                │ → Immediate escalation + alert
└─────────────────────────────────┘
    ↓
Feedback Loop → Quality Monitoring → Optimization

We used a hybrid approach: a small fine-tuned classifier for intent routing, and RAG for answer generation. This gave us the speed of classification with the accuracy and flexibility of retrieval.

Week-by-Week Build

Week 1: Data Pipeline & RAG Foundation

The first challenge was data.

The client had:

340 help center articles (in Zendesk)
12,000 historical support tickets with agent responses
Internal runbooks and SOPs (Google Docs)
Slack threads with edge case solutions

We built a data ingestion pipeline that:

Extracted content from Zendesk, Google Docs, and Slack via APIs
Cleaned and normalized text (removed HTML, fixed formatting, resolved abbreviations)
Chunked documents using semantic chunking (not fixed-size) for better retrieval
Embedded chunks using OpenAI's text-embedding-3-large
Stored vectors in Qdrant (chosen for hybrid search support)

Week 2: Intent Classification & RAG Pipeline

Intent classifier: We fine-tuned DistilBERT on the 12,000 historical tickets, labeled into 5 categories:

Intent	% of Volume	Action
How-to questions	42%	Auto-resolve via RAG
Bug reports	18%	Classify → route to engineering
Account/billing	15%	Auto-resolve via RAG + API
Feature requests	12%	Log → acknowledge → route to product
Escalation/urgent	13%	Immediate human routing

RAG pipeline: For auto-resolvable tickets:

Query embedding + hybrid search (semantic + keyword) in Qdrant
Top 5 chunks retrieved and re-ranked using a cross-encoder
Retrieved context + customer message + system prompt → Claude 3.5 Sonnet
Response generated with citations (linked to specific help articles)
Confidence score calculated (if <0.7, route to human instead)

Key decision: The confidence threshold was critical. We'd rather route a ticket to a human than give a wrong answer. Starting at 0.7 and tuning down over time was the right approach.

Week 3: Multi-Channel Integration & Testing

Channel integration:

Email: Connected to Zendesk API — AI responses appear as agent replies
In-app chat: Embedded widget with real-time AI responses
Slack: Bot that monitors the support channel and responds in-thread

Testing protocol:

Fed 500 historical tickets through the system
Had human agents grade each AI response (correct, partially correct, wrong)
Measured: accuracy (correct answers), coverage (% of tickets handled), latency (response time)

Week 3 results: 61% auto-resolution, 94% accuracy on resolved tickets, 1.8s average response time.

The gap between 61% and our target of 50%+ was encouraging. But we wanted more.

Week 4: Optimization & Production Deployment

Optimizations that moved us from 61% to 73%:

Added FAQ-style Q&A pairs: We extracted the top 50 questions from historical tickets and created exact-match responses. This alone added 7% coverage.
Improved chunking for product docs: The product documentation had nested steps that were being split across chunks. We implemented parent-child chunking — small chunks for retrieval, parent chunks for context.
Conversation history: For chat channels, we included the last 3 messages as context. This handled follow-up questions like "what about the enterprise plan?" without re-explaining.
Reduced false routing: Tuned the intent classifier to better distinguish between "how do I do X" (auto-resolvable) and "X isn't working" (might need human).

Production deployment:

Deployed on AWS using our Cloud CI/CD pipeline
Set up monitoring dashboards for response quality, latency, and cost
Created escalation alerts for low-confidence responses
Documented runbooks for the client's team

Results After 90 Days

Metric	Before	After	Change
Auto-resolution rate	0%	73%	—
Average response time	6 hours	12 seconds (AI) / 45 min (human)	99.9% faster for AI
CSAT score	3.4/5	4.2/5	+23%
Support team capacity	200 tickets/day (strained)	200 tickets/day (comfortable)	Team handles 2x complexity
Cost per ticket	$4.20	$1.10	-74%
Monthly support cost	$24,000	$8,500	-65%

Technical Lessons Learned

1. Chunking Strategy Matters More Than Model Choice

We spent 2 days testing GPT-4o vs Claude 3.5 Sonnet vs Llama 3. The accuracy difference was minimal (±2%). But switching from fixed-size to semantic chunking improved accuracy by 12%.

Takeaway: Invest in your data pipeline, not model shopping.

2. Confidence Scoring Saves Trust

The confidence threshold prevented ~15% of tickets from getting wrong answers. Every wrong answer erodes customer trust faster than a slow human response builds it.

We calibrated the threshold by tracking false positive rates weekly and adjusting. This is part of our ongoing monitoring service.

3. Human Escalation Must Be Seamless

When the AI routes to a human, it includes:

Full conversation history
Retrieved context chunks (so the human doesn't re-search)
AI's attempted answer (marked as draft, not sent to customer)
Confidence score and reason for escalation

This cut human resolution time by 40% even for escalated tickets.

4. The Knowledge Base Is the Product

The AI is only as good as the knowledge base. We discovered that 30% of help articles were outdated or incomplete. Fixing the content improved AI accuracy more than any technical optimization.

Recommendation: Before building a support AI agent, audit your help center. Remove outdated content, fill gaps, and standardize formatting.

Architecture Deep Dive

For technical readers, here's the full stack:

AI Layer

Intent classification: Fine-tuned DistilBERT (HuggingFace, hosted on AWS SageMaker)
Embedding: OpenAI text-embedding-3-large
Vector DB: Qdrant (self-hosted, single node)
Re-ranker: Cross-encoder (ms-marco-MiniLM-L-6-v2)
LLM: Claude 3.5 Sonnet (for response generation)
Confidence: Custom scoring based on retrieval relevance + LLM self-assessment

Application Layer

Backend: Python FastAPI
Queue: Redis for async ticket processing
Database: PostgreSQL for ticket history, analytics, feedback
API integrations: Zendesk, Slack, custom chat widget

Infrastructure

Hosting: AWS ECS (Fargate) for the application, SageMaker for the classifier
CI/CD: GitHub Actions → ECR → ECS
Monitoring: Datadog for infrastructure, custom dashboards for AI metrics
Cost: ~$450/month total infrastructure

Monitoring Stack

Response quality score (LLM-as-judge, sampled daily)
Retrieval relevance metrics (MRR, NDCG)
Latency percentiles (p50, p95, p99)
Cost per ticket trending
CSAT correlation with AI vs human responses
Drift detection (monthly model evaluation)

Build vs. Buy: Why Custom Won

The client evaluated several off-the-shelf support AI tools (Intercom Fin, Zendesk AI, Ada). Here's why they chose a custom build:

Factor	Off-the-Shelf	Custom (AIqwip)
Resolution rate	30–45%	73%
Customization	Limited	Full control
Multi-channel	Usually single	Email + chat + Slack
Integration depth	Surface level	Deep (account data, billing API)
Cost at scale	$1–$3 per resolution	$0.15 per resolution
Brand voice	Generic	Fully customized

The custom build cost more upfront but paid for itself in 6 weeks through reduced support costs.

Want Similar Results?

We offer a pre-built Customer Support AI Agent that includes:

Tier-1 auto-resolution with RAG
Multi-channel support (email, chat, Slack, WhatsApp)
Human escalation with context
CSAT tracking and sentiment analysis
Knowledge base integration

Or, if you need a fully custom solution:

Book a discovery call: We'll assess your support volume, channels, and knowledge base.
MVP in 3–4 weeks: Working support AI agent, deployed and handling tickets.
Optimize over 90 days: Target 60%+ auto-resolution with ongoing monitoring and optimization.

The Founder's Guide to AI Due Diligence Before Fundraising

RAG vs Fine-Tuning: Which Approach is Right for Your AI Product?

How We Built a Customer Support AI Agent That Resolves 73% of Tickets

The Brief

The Architecture Decision

Why RAG, Not Fine-Tuning

System Architecture

Week-by-Week Build

Week 1: Data Pipeline & RAG Foundation

Week 2: Intent Classification & RAG Pipeline

Week 3: Multi-Channel Integration & Testing

Week 4: Optimization & Production Deployment

Results After 90 Days

Technical Lessons Learned

1. Chunking Strategy Matters More Than Model Choice

2. Confidence Scoring Saves Trust

3. Human Escalation Must Be Seamless

4. The Knowledge Base Is the Product

Architecture Deep Dive

AI Layer

Application Layer

Infrastructure

Monitoring Stack

Build vs. Buy: Why Custom Won

Want Similar Results?

Need help building your AI product?

How We Built a Customer Support AI Agent That Resolves 73% of Tickets

The Brief

The Architecture Decision

Why RAG, Not Fine-Tuning

System Architecture

Week-by-Week Build

Week 1: Data Pipeline & RAG Foundation

Week 2: Intent Classification & RAG Pipeline

Week 3: Multi-Channel Integration & Testing

Week 4: Optimization & Production Deployment

Results After 90 Days

Technical Lessons Learned

1. Chunking Strategy Matters More Than Model Choice

2. Confidence Scoring Saves Trust

3. Human Escalation Must Be Seamless

4. The Knowledge Base Is the Product

Architecture Deep Dive

AI Layer

Application Layer

Infrastructure

Monitoring Stack

Build vs. Buy: Why Custom Won

Want Similar Results?

Need help building your AI product?