Skip to main content
Enterprise Communications Enterprise

Email Agent with RL Fine-Tuning: 5x Faster Than GPT-4

Challenge

Executives searching years of email history with generic LLMs that were too slow, too expensive, and performed poorly on domain-specific retrieval tasks.

Solution

RL fine-tuned Qwen 14B using OpenPipe ART and GRPO, creating a specialised email search and Q&A model that outperforms GPT-4 on domain retrieval.

Results

5x faster than GPT-4
60%+ error rate reduction
Beat GPT-4 on domain retrieval
Significant cost reduction

Challenge

An enterprise client needed to transform how their executives accessed historical email communications:

  • Years of Email History: Executives needed to search and query across massive email archives
  • Generic LLM Limitations: Off-the-shelf models like GPT-4 were too slow for production use
  • High Costs: API costs for GPT-4 queries at scale were unsustainable
  • Poor Domain Performance: Generic models struggled with company-specific terminology, people, and context
  • Latency Requirements: Executives expected near-instant responses, not 10+ second waits

The existing solution using GPT-4 was accurate but prohibitively slow and expensive. A cheaper model like GPT-3.5 was faster but made too many errors on domain-specific queries.

Solution

We implemented a reinforcement learning fine-tuning approach to create a specialised email search and Q&A model:

RL Fine-Tuning with OpenPipe ART

Base Model Selection: Qwen 14B chosen for its strong reasoning capabilities and open-source flexibility

Fine-Tuning Approach:

  • OpenPipe ART (Adversarial Reward Training) for automated preference learning
  • GRPO (Group Relative Policy Optimization) for stable RL training
  • LoRA adapters for efficient fine-tuning without full model retraining

Training Pipeline

  1. Data Collection: Curated examples of email search queries with correct and incorrect retrievals
  2. Reward Modelling: Trained reward model on human preferences for retrieval quality
  3. RL Training: GRPO optimisation to maximise retrieval accuracy while maintaining coherence
  4. Evaluation: Continuous benchmarking against GPT-4 on domain-specific test set

Model Specialisation

The fine-tuned model learned:

  • Company-specific terminology and acronyms
  • People names and organisational relationships
  • Project codes and internal references
  • Email threading and conversation context
  • Date/time reasoning for “last month” or “Q3 2023” queries

Results

The fine-tuned model dramatically outperformed both generic alternatives:

Performance Comparison

MetricGPT-4GPT-3.5Our Fine-Tuned Model
Query Latency8-12 sec2-3 sec1.5-2 sec
Domain Retrieval Accuracy78%52%91%
Error Rate22%48%Under 9%
Cost per 1K queries$15+$0.50$0.10

Key Achievements

  • 5x Faster Than GPT-4: Sub-2-second response times vs 8-12 seconds
  • 60%+ Error Rate Reduction: From 22% errors (GPT-4) to under 9%
  • Beat GPT-4 on Domain Retrieval: 91% accuracy vs 78% for GPT-4
  • 150x Cost Reduction: $0.10 per 1K queries vs $15+ for GPT-4

Business Impact

  • Executives now get instant answers from email archives
  • Search abandoned rates dropped 70%
  • Monthly API costs reduced from $45K to $300
  • Model runs on company infrastructure—no data leaves premises

Technical Details

RL Training Configuration

# GRPO training configuration
training_config = {
    "base_model": "Qwen/Qwen2.5-14B",
    "method": "grpo",
    "reward_model": "custom_email_retrieval_rm",
    "lora_config": {
        "r": 64,
        "lora_alpha": 128,
        "target_modules": ["q_proj", "v_proj", "k_proj", "o_proj"],
        "lora_dropout": 0.05
    },
    "training_args": {
        "learning_rate": 1e-5,
        "batch_size": 4,
        "gradient_accumulation_steps": 8,
        "num_epochs": 3,
        "warmup_ratio": 0.1
    }
}

Why GRPO Over PPO

We chose GRPO (Group Relative Policy Optimization) over traditional PPO for several reasons:

  1. Stability: GRPO provides more stable training on smaller datasets
  2. Sample Efficiency: Requires fewer examples to achieve strong results
  3. Compute Efficiency: Lower memory footprint than full PPO
  4. Quality: Better alignment with human preferences on retrieval tasks

Evaluation Framework

Continuous evaluation throughout training:

  • Retrieval Accuracy: Does the model find the right emails?
  • Answer Quality: Are answers factually correct and complete?
  • Latency: Response time under 2 seconds?
  • Coherence: Are responses well-formed and professional?
  • Safety: No hallucinated emails or fabricated content?

Infrastructure

  • Training: 4x A100 80GB GPUs, 2 weeks training time
  • Inference: Single A100 40GB for production serving
  • Deployment: On-premise, air-gapped environment
  • Integration: REST API compatible with existing email search UI

Key Insights

Why RL Fine-Tuning?

Traditional supervised fine-tuning (SFT) wasn’t sufficient because:

  1. Retrieval is Nuanced: “Correct” retrieval isn’t binary—some results are better than others
  2. Preference Learning: RL captures the subtle preferences humans have for retrieval quality
  3. Exploration: RL allows the model to discover better retrieval strategies
  4. Alignment: GRPO specifically optimises for the outcomes users care about

Lessons Learned

  1. Data Quality > Quantity: 5,000 high-quality preference pairs outperformed 50,000 noisy examples
  2. Domain Expertise Matters: Our reward model needed to understand email-specific success criteria
  3. Evaluation is Critical: Continuous benchmarking caught regression early
  4. Start with Strong Base: Qwen 14B’s reasoning capabilities made fine-tuning more effective

Project Details

  • Duration: 3 months from kickoff to production
  • Team: 2 ML engineers
  • Training Data: 5,000 curated preference pairs
  • Model Size: 14B parameters (LoRA adapters ~500MB)
  • Deployment: On-premise, single GPU inference

Want to fine-tune models for your domain-specific use case? Contact us to explore how RL fine-tuning can give you GPT-4 quality at a fraction of the cost.

Technologies Used

Qwen 14B OpenPipe ART GRPO LoRA

Timeline

3 months delivery

Ready to achieve similar results?

Let's discuss how we can help your business succeed with AI.