How We Built a Customer Support AI Agent That Resolves 73% of Tickets
A detailed case study of how we built a customer support AI agent for a B2B SaaS company — architecture decisions, RAG pipeline design, and the path to 73% auto-resolution.
In Q4 2024, a Series A B2B SaaS company came to us with a problem: their 4-person support team was drowning in 200+ tickets per day. Response times had ballooned to 6 hours. Customer satisfaction was dropping.
They didn't need more support reps. They needed an AI agent that could handle the repetitive tier-1 tickets — instantly, accurately, and with the right brand voice.
Here's exactly how we built it, the architectural decisions we made, and what we'd do differently.
The Brief
Client: A B2B SaaS platform for mid-market companies (identity withheld under NDA)
Industry: Operations Management
Support volume: 200–250 tickets/day across email, in-app chat, and Slack
Team size: 4 support agents
Goal: Automate 50%+ of tier-1 tickets while maintaining 90%+ CSAT
Budget: $8,000 for MVP, $2,000/month ongoing
Timeline: 4 weeks to production
The Architecture Decision
Why RAG, Not Fine-Tuning
We considered both approaches (read our full RAG vs fine-tuning comparison) and chose RAG for three reasons:
- Their help center changes weekly: New features, updated workflows, policy changes. Fine-tuning would require retraining every time content changed. RAG pulls the latest version automatically.
- Source attribution was required: The client needed the AI to cite specific help articles so customers could read more. RAG provides this natively.
- Speed to production: RAG MVP in 3 weeks. Fine-tuning would have taken 8–10 weeks and they were losing customers now.
System Architecture
Customer Message (email/chat/Slack)
↓
Intent Classifier (fine-tuned DistilBERT)
↓
┌─────────────────────────────────┐
│ Tier-1 (automatable) │ → RAG Pipeline → AI Response → Customer
│ │
│ Tier-2 (needs human) │ → Route to human agent with AI-generated context
│ │
│ Urgent/Sensitive │ → Immediate escalation + alert
└─────────────────────────────────┘
↓
Feedback Loop → Quality Monitoring → OptimizationWe used a hybrid approach: a small fine-tuned classifier for intent routing, and RAG for answer generation. This gave us the speed of classification with the accuracy and flexibility of retrieval.
Week-by-Week Build
Week 1: Data Pipeline & RAG Foundation
The first challenge was data.
The client had:
- 340 help center articles (in Zendesk)
- 12,000 historical support tickets with agent responses
- Internal runbooks and SOPs (Google Docs)
- Slack threads with edge case solutions
We built a data ingestion pipeline that:
- Extracted content from Zendesk, Google Docs, and Slack via APIs
- Cleaned and normalized text (removed HTML, fixed formatting, resolved abbreviations)
- Chunked documents using semantic chunking (not fixed-size) for better retrieval
- Embedded chunks using OpenAI's text-embedding-3-large
- Stored vectors in Qdrant (chosen for hybrid search support)
Key decision: We used semantic chunking instead of fixed 512-token chunks. This improved retrieval relevance by 23% in our initial tests because each chunk represented a complete concept, not an arbitrary text split.
Week 2: Intent Classification & RAG Pipeline
Intent classifier: We fine-tuned DistilBERT on the 12,000 historical tickets, labeled into 5 categories:
| Intent | % of Volume | Action |
|---|---|---|
| How-to questions | 42% | Auto-resolve via RAG |
| Bug reports | 18% | Classify → route to engineering |
| Account/billing | 15% | Auto-resolve via RAG + API |
| Feature requests | 12% | Log → acknowledge → route to product |
| Escalation/urgent | 13% | Immediate human routing |
RAG pipeline: For auto-resolvable tickets:
- Query embedding + hybrid search (semantic + keyword) in Qdrant
- Top 5 chunks retrieved and re-ranked using a cross-encoder
- Retrieved context + customer message + system prompt → Claude 3.5 Sonnet
- Response generated with citations (linked to specific help articles)
- Confidence score calculated (if <0.7, route to human instead)
Key decision: The confidence threshold was critical. We'd rather route a ticket to a human than give a wrong answer. Starting at 0.7 and tuning down over time was the right approach.
Week 3: Multi-Channel Integration & Testing
Channel integration:
- Email: Connected to Zendesk API — AI responses appear as agent replies
- In-app chat: Embedded widget with real-time AI responses
- Slack: Bot that monitors the support channel and responds in-thread
Testing protocol:
- Fed 500 historical tickets through the system
- Had human agents grade each AI response (correct, partially correct, wrong)
- Measured: accuracy (correct answers), coverage (% of tickets handled), latency (response time)
Week 3 results: 61% auto-resolution, 94% accuracy on resolved tickets, 1.8s average response time.
The gap between 61% and our target of 50%+ was encouraging. But we wanted more.
Week 4: Optimization & Production Deployment
Optimizations that moved us from 61% to 73%:
- Added FAQ-style Q&A pairs: We extracted the top 50 questions from historical tickets and created exact-match responses. This alone added 7% coverage.
- Improved chunking for product docs: The product documentation had nested steps that were being split across chunks. We implemented parent-child chunking — small chunks for retrieval, parent chunks for context.
- Conversation history: For chat channels, we included the last 3 messages as context. This handled follow-up questions like "what about the enterprise plan?" without re-explaining.
- Reduced false routing: Tuned the intent classifier to better distinguish between "how do I do X" (auto-resolvable) and "X isn't working" (might need human).
Production deployment:
- Deployed on AWS using our Cloud CI/CD pipeline
- Set up monitoring dashboards for response quality, latency, and cost
- Created escalation alerts for low-confidence responses
- Documented runbooks for the client's team
Results After 90 Days
| Metric | Before | After | Change |
|---|---|---|---|
| Auto-resolution rate | 0% | 73% | — |
| Average response time | 6 hours | 12 seconds (AI) / 45 min (human) | 99.9% faster for AI |
| CSAT score | 3.4/5 | 4.2/5 | +23% |
| Support team capacity | 200 tickets/day (strained) | 200 tickets/day (comfortable) | Team handles 2x complexity |
| Cost per ticket | $4.20 | $1.10 | -74% |
| Monthly support cost | $24,000 | $8,500 | -65% |
The support team didn't shrink. They shifted from answering "how do I export a CSV?" 40 times a day to handling complex escalations, building better documentation, and doing proactive customer outreach.
Technical Lessons Learned
1. Chunking Strategy Matters More Than Model Choice
We spent 2 days testing GPT-4o vs Claude 3.5 Sonnet vs Llama 3. The accuracy difference was minimal (±2%). But switching from fixed-size to semantic chunking improved accuracy by 12%.
Takeaway: Invest in your data pipeline, not model shopping.
2. Confidence Scoring Saves Trust
The confidence threshold prevented ~15% of tickets from getting wrong answers. Every wrong answer erodes customer trust faster than a slow human response builds it.
We calibrated the threshold by tracking false positive rates weekly and adjusting. This is part of our ongoing monitoring service.
3. Human Escalation Must Be Seamless
When the AI routes to a human, it includes:
- Full conversation history
- Retrieved context chunks (so the human doesn't re-search)
- AI's attempted answer (marked as draft, not sent to customer)
- Confidence score and reason for escalation
This cut human resolution time by 40% even for escalated tickets.
4. The Knowledge Base Is the Product
The AI is only as good as the knowledge base. We discovered that 30% of help articles were outdated or incomplete. Fixing the content improved AI accuracy more than any technical optimization.
Recommendation: Before building a support AI agent, audit your help center. Remove outdated content, fill gaps, and standardize formatting.
Architecture Deep Dive
For technical readers, here's the full stack:
AI Layer
- Intent classification: Fine-tuned DistilBERT (HuggingFace, hosted on AWS SageMaker)
- Embedding: OpenAI text-embedding-3-large
- Vector DB: Qdrant (self-hosted, single node)
- Re-ranker: Cross-encoder (ms-marco-MiniLM-L-6-v2)
- LLM: Claude 3.5 Sonnet (for response generation)
- Confidence: Custom scoring based on retrieval relevance + LLM self-assessment
Application Layer
- Backend: Python FastAPI
- Queue: Redis for async ticket processing
- Database: PostgreSQL for ticket history, analytics, feedback
- API integrations: Zendesk, Slack, custom chat widget
Infrastructure
- Hosting: AWS ECS (Fargate) for the application, SageMaker for the classifier
- CI/CD: GitHub Actions → ECR → ECS
- Monitoring: Datadog for infrastructure, custom dashboards for AI metrics
- Cost: ~$450/month total infrastructure
Monitoring Stack
- Response quality score (LLM-as-judge, sampled daily)
- Retrieval relevance metrics (MRR, NDCG)
- Latency percentiles (p50, p95, p99)
- Cost per ticket trending
- CSAT correlation with AI vs human responses
- Drift detection (monthly model evaluation)
Build vs. Buy: Why Custom Won
The client evaluated several off-the-shelf support AI tools (Intercom Fin, Zendesk AI, Ada). Here's why they chose a custom build:
| Factor | Off-the-Shelf | Custom (AIqwip) |
|---|---|---|
| Resolution rate | 30–45% | 73% |
| Customization | Limited | Full control |
| Multi-channel | Usually single | Email + chat + Slack |
| Integration depth | Surface level | Deep (account data, billing API) |
| Cost at scale | $1–$3 per resolution | $0.15 per resolution |
| Brand voice | Generic | Fully customized |
The custom build cost more upfront but paid for itself in 6 weeks through reduced support costs.
Want Similar Results?
We offer a pre-built Customer Support AI Agent that includes:
- Tier-1 auto-resolution with RAG
- Multi-channel support (email, chat, Slack, WhatsApp)
- Human escalation with context
- CSAT tracking and sentiment analysis
- Knowledge base integration
Or, if you need a fully custom solution:
- Book a discovery call: We'll assess your support volume, channels, and knowledge base.
- MVP in 3–4 weeks: Working support AI agent, deployed and handling tickets.
- Optimize over 90 days: Target 60%+ auto-resolution with ongoing monitoring and optimization.
