Challenge
An enterprise client needed to transform how their executives accessed historical email communications:
- Years of Email History: Executives needed to search and query across massive email archives
- Generic LLM Limitations: Off-the-shelf models like GPT-4 were too slow for production use
- High Costs: API costs for GPT-4 queries at scale were unsustainable
- Poor Domain Performance: Generic models struggled with company-specific terminology, people, and context
- Latency Requirements: Executives expected near-instant responses, not 10+ second waits
The existing solution using GPT-4 was accurate but prohibitively slow and expensive. A cheaper model like GPT-3.5 was faster but made too many errors on domain-specific queries.
Solution
We implemented a reinforcement learning fine-tuning approach to create a specialised email search and Q&A model:
RL Fine-Tuning with OpenPipe ART
Base Model Selection: Qwen 14B chosen for its strong reasoning capabilities and open-source flexibility
Fine-Tuning Approach:
- OpenPipe ART (Adversarial Reward Training) for automated preference learning
- GRPO (Group Relative Policy Optimization) for stable RL training
- LoRA adapters for efficient fine-tuning without full model retraining
Training Pipeline
- Data Collection: Curated examples of email search queries with correct and incorrect retrievals
- Reward Modelling: Trained reward model on human preferences for retrieval quality
- RL Training: GRPO optimisation to maximise retrieval accuracy while maintaining coherence
- Evaluation: Continuous benchmarking against GPT-4 on domain-specific test set
Model Specialisation
The fine-tuned model learned:
- Company-specific terminology and acronyms
- People names and organisational relationships
- Project codes and internal references
- Email threading and conversation context
- Date/time reasoning for “last month” or “Q3 2023” queries
Results
The fine-tuned model dramatically outperformed both generic alternatives:
Performance Comparison
| Metric | GPT-4 | GPT-3.5 | Our Fine-Tuned Model |
|---|---|---|---|
| Query Latency | 8-12 sec | 2-3 sec | 1.5-2 sec |
| Domain Retrieval Accuracy | 78% | 52% | 91% |
| Error Rate | 22% | 48% | Under 9% |
| Cost per 1K queries | $15+ | $0.50 | $0.10 |
Key Achievements
- 5x Faster Than GPT-4: Sub-2-second response times vs 8-12 seconds
- 60%+ Error Rate Reduction: From 22% errors (GPT-4) to under 9%
- Beat GPT-4 on Domain Retrieval: 91% accuracy vs 78% for GPT-4
- 150x Cost Reduction: $0.10 per 1K queries vs $15+ for GPT-4
Business Impact
- Executives now get instant answers from email archives
- Search abandoned rates dropped 70%
- Monthly API costs reduced from $45K to $300
- Model runs on company infrastructure—no data leaves premises
Technical Details
RL Training Configuration
# GRPO training configuration
training_config = {
"base_model": "Qwen/Qwen2.5-14B",
"method": "grpo",
"reward_model": "custom_email_retrieval_rm",
"lora_config": {
"r": 64,
"lora_alpha": 128,
"target_modules": ["q_proj", "v_proj", "k_proj", "o_proj"],
"lora_dropout": 0.05
},
"training_args": {
"learning_rate": 1e-5,
"batch_size": 4,
"gradient_accumulation_steps": 8,
"num_epochs": 3,
"warmup_ratio": 0.1
}
}
Why GRPO Over PPO
We chose GRPO (Group Relative Policy Optimization) over traditional PPO for several reasons:
- Stability: GRPO provides more stable training on smaller datasets
- Sample Efficiency: Requires fewer examples to achieve strong results
- Compute Efficiency: Lower memory footprint than full PPO
- Quality: Better alignment with human preferences on retrieval tasks
Evaluation Framework
Continuous evaluation throughout training:
- Retrieval Accuracy: Does the model find the right emails?
- Answer Quality: Are answers factually correct and complete?
- Latency: Response time under 2 seconds?
- Coherence: Are responses well-formed and professional?
- Safety: No hallucinated emails or fabricated content?
Infrastructure
- Training: 4x A100 80GB GPUs, 2 weeks training time
- Inference: Single A100 40GB for production serving
- Deployment: On-premise, air-gapped environment
- Integration: REST API compatible with existing email search UI
Key Insights
Why RL Fine-Tuning?
Traditional supervised fine-tuning (SFT) wasn’t sufficient because:
- Retrieval is Nuanced: “Correct” retrieval isn’t binary—some results are better than others
- Preference Learning: RL captures the subtle preferences humans have for retrieval quality
- Exploration: RL allows the model to discover better retrieval strategies
- Alignment: GRPO specifically optimises for the outcomes users care about
Lessons Learned
- Data Quality > Quantity: 5,000 high-quality preference pairs outperformed 50,000 noisy examples
- Domain Expertise Matters: Our reward model needed to understand email-specific success criteria
- Evaluation is Critical: Continuous benchmarking caught regression early
- Start with Strong Base: Qwen 14B’s reasoning capabilities made fine-tuning more effective
Project Details
- Duration: 3 months from kickoff to production
- Team: 2 ML engineers
- Training Data: 5,000 curated preference pairs
- Model Size: 14B parameters (LoRA adapters ~500MB)
- Deployment: On-premise, single GPU inference
Want to fine-tune models for your domain-specific use case? Contact us to explore how RL fine-tuning can give you GPT-4 quality at a fraction of the cost.