Aiqwip
PricingAbout UsContact Us
Aiqwip Logo

Aiqwip Technologies Private Limited

From idea to AI product. In weeks. We are the GenAI product development partner for seed and Series A B2B SaaS founders.

Services

  • Idea to MVP
  • MVP to V1.0
  • Data Engineering
  • Cloud & MLOps
  • Performance Monitoring
  • Customer Success

Solutions

  • Front Desk AI Agent
  • Inside Sales AI Agent
  • Customer Support AI Agent
  • Recruitment AI Agent
  • Procure-to-Pay AI Agent

Company

  • About Us
  • Pricing
  • Blog
  • Careers
  • Privacy Policy
  • Terms of Service
  • Contact Us

2026 Aiqwip Technologies Private Limited. All rights reserved.

LinkedInTwitterYouTube
How We Built a Customer Support AI Agent That Resolves 73% of Tickets
HomeBlogHow We Built a Customer Support AI Agent That Resolves 73% of Tickets
BlogApril 202612 min read

How We Built a Customer Support AI Agent That Resolves 73% of Tickets

A detailed case study of how we built a customer support AI agent for a B2B SaaS company — architecture decisions, RAG pipeline design, and the path to 73% auto-resolution.

In Q4 2024, a Series A B2B SaaS company came to us with a problem: their 4-person support team was drowning in 200+ tickets per day. Response times had ballooned to 6 hours. Customer satisfaction was dropping.

They didn't need more support reps. They needed an AI agent that could handle the repetitive tier-1 tickets — instantly, accurately, and with the right brand voice.

Here's exactly how we built it, the architectural decisions we made, and what we'd do differently.



The Brief

Client: A B2B SaaS platform for mid-market companies (identity withheld under NDA)

Industry: Operations Management

Support volume: 200–250 tickets/day across email, in-app chat, and Slack

Team size: 4 support agents

Goal: Automate 50%+ of tier-1 tickets while maintaining 90%+ CSAT

Budget: $8,000 for MVP, $2,000/month ongoing

Timeline: 4 weeks to production



The Architecture Decision


Why RAG, Not Fine-Tuning


We considered both approaches (read our full RAG vs fine-tuning comparison) and chose RAG for three reasons:

  1. Their help center changes weekly: New features, updated workflows, policy changes. Fine-tuning would require retraining every time content changed. RAG pulls the latest version automatically.
  2. Source attribution was required: The client needed the AI to cite specific help articles so customers could read more. RAG provides this natively.
  3. Speed to production: RAG MVP in 3 weeks. Fine-tuning would have taken 8–10 weeks and they were losing customers now.

System Architecture

Customer Message (email/chat/Slack)
    ↓
Intent Classifier (fine-tuned DistilBERT)
    ↓
┌─────────────────────────────────┐
│ Tier-1 (automatable)            │ → RAG Pipeline → AI Response → Customer
│                                 │
│ Tier-2 (needs human)            │ → Route to human agent with AI-generated context
│                                 │
│ Urgent/Sensitive                │ → Immediate escalation + alert
└─────────────────────────────────┘
    ↓
Feedback Loop → Quality Monitoring → Optimization

We used a hybrid approach: a small fine-tuned classifier for intent routing, and RAG for answer generation. This gave us the speed of classification with the accuracy and flexibility of retrieval.



Week-by-Week Build


Week 1: Data Pipeline & RAG Foundation


The first challenge was data.


The client had:

  • 340 help center articles (in Zendesk)
  • 12,000 historical support tickets with agent responses
  • Internal runbooks and SOPs (Google Docs)
  • Slack threads with edge case solutions

We built a data ingestion pipeline that:

  1. Extracted content from Zendesk, Google Docs, and Slack via APIs
  2. Cleaned and normalized text (removed HTML, fixed formatting, resolved abbreviations)
  3. Chunked documents using semantic chunking (not fixed-size) for better retrieval
  4. Embedded chunks using OpenAI's text-embedding-3-large
  5. Stored vectors in Qdrant (chosen for hybrid search support)


Key decision:
We used semantic chunking instead of fixed 512-token chunks. This improved retrieval relevance by 23% in our initial tests because each chunk represented a complete concept, not an arbitrary text split.


Week 2: Intent Classification & RAG Pipeline


Intent classifier:
We fine-tuned DistilBERT on the 12,000 historical tickets, labeled into 5 categories:

Intent % of Volume Action
How-to questions 42% Auto-resolve via RAG
Bug reports 18% Classify → route to engineering
Account/billing 15% Auto-resolve via RAG + API
Feature requests 12% Log → acknowledge → route to product
Escalation/urgent 13% Immediate human routing

RAG pipeline: For auto-resolvable tickets:

  1. Query embedding + hybrid search (semantic + keyword) in Qdrant
  2. Top 5 chunks retrieved and re-ranked using a cross-encoder
  3. Retrieved context + customer message + system prompt → Claude 3.5 Sonnet
  4. Response generated with citations (linked to specific help articles)
  5. Confidence score calculated (if <0.7, route to human instead)

Key decision: The confidence threshold was critical. We'd rather route a ticket to a human than give a wrong answer. Starting at 0.7 and tuning down over time was the right approach.

Week 3: Multi-Channel Integration & Testing


Channel integration:

  • Email: Connected to Zendesk API — AI responses appear as agent replies
  • In-app chat: Embedded widget with real-time AI responses
  • Slack: Bot that monitors the support channel and responds in-thread

Testing protocol:

  • Fed 500 historical tickets through the system
  • Had human agents grade each AI response (correct, partially correct, wrong)
  • Measured: accuracy (correct answers), coverage (% of tickets handled), latency (response time)

Week 3 results: 61% auto-resolution, 94% accuracy on resolved tickets, 1.8s average response time.

The gap between 61% and our target of 50%+ was encouraging. But we wanted more.

Week 4: Optimization & Production Deployment


Optimizations that moved us from 61% to 73%:

  1. Added FAQ-style Q&A pairs: We extracted the top 50 questions from historical tickets and created exact-match responses. This alone added 7% coverage.
  2. Improved chunking for product docs: The product documentation had nested steps that were being split across chunks. We implemented parent-child chunking — small chunks for retrieval, parent chunks for context.
  3. Conversation history: For chat channels, we included the last 3 messages as context. This handled follow-up questions like "what about the enterprise plan?" without re-explaining.
  4. Reduced false routing: Tuned the intent classifier to better distinguish between "how do I do X" (auto-resolvable) and "X isn't working" (might need human).


Production deployment:

  • Deployed on AWS using our Cloud CI/CD pipeline
  • Set up monitoring dashboards for response quality, latency, and cost
  • Created escalation alerts for low-confidence responses
  • Documented runbooks for the client's team



Results After 90 Days

Metric Before After Change
Auto-resolution rate 0% 73% —
Average response time 6 hours 12 seconds (AI) / 45 min (human) 99.9% faster for AI
CSAT score 3.4/5 4.2/5 +23%
Support team capacity 200 tickets/day (strained) 200 tickets/day (comfortable) Team handles 2x complexity
Cost per ticket $4.20 $1.10 -74%
Monthly support cost $24,000 $8,500 -65%

The support team didn't shrink. They shifted from answering "how do I export a CSV?" 40 times a day to handling complex escalations, building better documentation, and doing proactive customer outreach.



Technical Lessons Learned

1. Chunking Strategy Matters More Than Model Choice


We spent 2 days testing GPT-4o vs Claude 3.5 Sonnet vs Llama 3. The accuracy difference was minimal (±2%). But switching from fixed-size to semantic chunking improved accuracy by 12%.

Takeaway: Invest in your data pipeline, not model shopping.

2. Confidence Scoring Saves Trust


The confidence threshold prevented ~15% of tickets from getting wrong answers. Every wrong answer erodes customer trust faster than a slow human response builds it.

We calibrated the threshold by tracking false positive rates weekly and adjusting. This is part of our ongoing monitoring service.

3. Human Escalation Must Be Seamless


When the AI routes to a human, it includes:

  • Full conversation history
  • Retrieved context chunks (so the human doesn't re-search)
  • AI's attempted answer (marked as draft, not sent to customer)
  • Confidence score and reason for escalation

This cut human resolution time by 40% even for escalated tickets.

4. The Knowledge Base Is the Product

The AI is only as good as the knowledge base. We discovered that 30% of help articles were outdated or incomplete. Fixing the content improved AI accuracy more than any technical optimization.

Recommendation: Before building a support AI agent, audit your help center. Remove outdated content, fill gaps, and standardize formatting.



Architecture Deep Dive


For technical readers, here's the full stack:

AI Layer

  • Intent classification: Fine-tuned DistilBERT (HuggingFace, hosted on AWS SageMaker)
  • Embedding: OpenAI text-embedding-3-large
  • Vector DB: Qdrant (self-hosted, single node)
  • Re-ranker: Cross-encoder (ms-marco-MiniLM-L-6-v2)
  • LLM: Claude 3.5 Sonnet (for response generation)
  • Confidence: Custom scoring based on retrieval relevance + LLM self-assessment

Application Layer

  • Backend: Python FastAPI
  • Queue: Redis for async ticket processing
  • Database: PostgreSQL for ticket history, analytics, feedback
  • API integrations: Zendesk, Slack, custom chat widget

Infrastructure

  • Hosting: AWS ECS (Fargate) for the application, SageMaker for the classifier
  • CI/CD: GitHub Actions → ECR → ECS
  • Monitoring: Datadog for infrastructure, custom dashboards for AI metrics
  • Cost: ~$450/month total infrastructure

Monitoring Stack

  • Response quality score (LLM-as-judge, sampled daily)
  • Retrieval relevance metrics (MRR, NDCG)
  • Latency percentiles (p50, p95, p99)
  • Cost per ticket trending
  • CSAT correlation with AI vs human responses
  • Drift detection (monthly model evaluation)



Build vs. Buy: Why Custom Won


The client evaluated several off-the-shelf support AI tools (Intercom Fin, Zendesk AI, Ada). Here's why they chose a custom build:

Factor Off-the-Shelf Custom (AIqwip)
Resolution rate 30–45% 73%
Customization Limited Full control
Multi-channel Usually single Email + chat + Slack
Integration depth Surface level Deep (account data, billing API)
Cost at scale $1–$3 per resolution $0.15 per resolution
Brand voice Generic Fully customized

The custom build cost more upfront but paid for itself in 6 weeks through reduced support costs.



Want Similar Results?


We offer a pre-built Customer Support AI Agent that includes:

  • Tier-1 auto-resolution with RAG
  • Multi-channel support (email, chat, Slack, WhatsApp)
  • Human escalation with context
  • CSAT tracking and sentiment analysis
  • Knowledge base integration

Or, if you need a fully custom solution:

  1. Book a discovery call: We'll assess your support volume, channels, and knowledge base.
  2. MVP in 3–4 weeks: Working support AI agent, deployed and handling tickets.
  3. Optimize over 90 days: Target 60%+ auto-resolution with ongoing monitoring and optimization.

About this blog

@Admin User
Published April 2026
12 min read

More resources

Building AI Products for Regulated Industries: Healthcare, Finance, Legal

April 2026

AI Agent Architecture: How We Design Multi-Agent Systems for B2B SaaS

April 2026

Previous

Building AI Products for Regulated Industries: Healthcare, Finance, Legal

Next

AI Agent Architecture: How We Design Multi-Agent Systems for B2B SaaS

Need help building your AI product?

We've helped 20+ US startup founders ship AI products in 4 weeks. Book a free discovery call and let's discuss your idea.

Book a Free Discovery CallSee our AI development services