Email Agent with RL Fine-Tuning: 5x Faster Than GPT-4

Challenge

An enterprise client needed to transform how their executives accessed historical email communications:

Years of Email History: Executives needed to search and query across massive email archives
Generic LLM Limitations: Off-the-shelf models like GPT-4 were too slow for production use
High Costs: API costs for GPT-4 queries at scale were unsustainable
Poor Domain Performance: Generic models struggled with company-specific terminology, people, and context
Latency Requirements: Executives expected near-instant responses, not 10+ second waits

The existing solution using GPT-4 was accurate but prohibitively slow and expensive. A cheaper model like GPT-3.5 was faster but made too many errors on domain-specific queries.

Solution

We implemented a reinforcement learning fine-tuning approach to create a specialised email search and Q&A model:

RL Fine-Tuning with OpenPipe ART

Base Model Selection: Qwen 14B chosen for its strong reasoning capabilities and open-source flexibility

Fine-Tuning Approach:

OpenPipe ART (Adversarial Reward Training) for automated preference learning
GRPO (Group Relative Policy Optimization) for stable RL training
LoRA adapters for efficient fine-tuning without full model retraining

Training Pipeline

Data Collection: Curated examples of email search queries with correct and incorrect retrievals
Reward Modelling: Trained reward model on human preferences for retrieval quality
RL Training: GRPO optimisation to maximise retrieval accuracy while maintaining coherence
Evaluation: Continuous benchmarking against GPT-4 on domain-specific test set

Model Specialisation

The fine-tuned model learned:

Company-specific terminology and acronyms
People names and organisational relationships
Project codes and internal references
Email threading and conversation context
Date/time reasoning for “last month” or “Q3 2023” queries

Results

The fine-tuned model dramatically outperformed both generic alternatives:

Performance Comparison

Metric	GPT-4	GPT-3.5	Our Fine-Tuned Model
Query Latency	8-12 sec	2-3 sec	1.5-2 sec
Domain Retrieval Accuracy	78%	52%	91%
Error Rate	22%	48%	Under 9%
Cost per 1K queries	$15+	$0.50	$0.10

Key Achievements

5x Faster Than GPT-4: Sub-2-second response times vs 8-12 seconds
60%+ Error Rate Reduction: From 22% errors (GPT-4) to under 9%
Beat GPT-4 on Domain Retrieval: 91% accuracy vs 78% for GPT-4
150x Cost Reduction: $0.10 per 1K queries vs $15+ for GPT-4

Business Impact

Executives now get instant answers from email archives
Search abandoned rates dropped 70%
Monthly API costs reduced from $45K to $300
Model runs on company infrastructure—no data leaves premises

Technical Details

RL Training Configuration

# GRPO training configuration
training_config = {
    "base_model": "Qwen/Qwen2.5-14B",
    "method": "grpo",
    "reward_model": "custom_email_retrieval_rm",
    "lora_config": {
        "r": 64,
        "lora_alpha": 128,
        "target_modules": ["q_proj", "v_proj", "k_proj", "o_proj"],
        "lora_dropout": 0.05
    },
    "training_args": {
        "learning_rate": 1e-5,
        "batch_size": 4,
        "gradient_accumulation_steps": 8,
        "num_epochs": 3,
        "warmup_ratio": 0.1
    }
}

Why GRPO Over PPO

We chose GRPO (Group Relative Policy Optimization) over traditional PPO for several reasons:

Stability: GRPO provides more stable training on smaller datasets
Sample Efficiency: Requires fewer examples to achieve strong results
Compute Efficiency: Lower memory footprint than full PPO
Quality: Better alignment with human preferences on retrieval tasks

Evaluation Framework

Continuous evaluation throughout training:

Retrieval Accuracy: Does the model find the right emails?
Answer Quality: Are answers factually correct and complete?
Latency: Response time under 2 seconds?
Coherence: Are responses well-formed and professional?
Safety: No hallucinated emails or fabricated content?

Infrastructure

Training: 4x A100 80GB GPUs, 2 weeks training time
Inference: Single A100 40GB for production serving
Deployment: On-premise, air-gapped environment
Integration: REST API compatible with existing email search UI

Key Insights

Why RL Fine-Tuning?

Traditional supervised fine-tuning (SFT) wasn’t sufficient because:

Retrieval is Nuanced: “Correct” retrieval isn’t binary—some results are better than others
Preference Learning: RL captures the subtle preferences humans have for retrieval quality
Exploration: RL allows the model to discover better retrieval strategies
Alignment: GRPO specifically optimises for the outcomes users care about

Lessons Learned

Data Quality > Quantity: 5,000 high-quality preference pairs outperformed 50,000 noisy examples
Domain Expertise Matters: Our reward model needed to understand email-specific success criteria
Evaluation is Critical: Continuous benchmarking caught regression early
Start with Strong Base: Qwen 14B’s reasoning capabilities made fine-tuning more effective

Project Details

Duration: 3 months from kickoff to production
Team: 2 ML engineers
Training Data: 5,000 curated preference pairs
Model Size: 14B parameters (LoRA adapters ~500MB)
Deployment: On-premise, single GPU inference

Want to fine-tune models for your domain-specific use case? Contact us to explore how RL fine-tuning can give you GPT-4 quality at a fraction of the cost.

Email Agent with RL Fine-Tuning: 5x Faster Than GPT-4

Challenge

Solution

Results

Challenge

Solution

RL Fine-Tuning with OpenPipe ART

Training Pipeline

Model Specialisation

Results

Performance Comparison

Key Achievements

Business Impact

Technical Details

RL Training Configuration

Why GRPO Over PPO

Evaluation Framework

Infrastructure

Key Insights

Why RL Fine-Tuning?

Lessons Learned

Project Details

Technologies Used

Timeline

Ready to achieve similar results?