Challenge
A large enterprise with 8,000+ internal documents and 2,000 users faced critical knowledge management issues:
- Scattered Knowledge: Internal documentation spread across multiple systems with no unified search
- Full Re-indexing Pain: Every document change triggered complete re-indexing of the entire corpus
- Performance Bottleneck: Re-indexing took hours, creating stale search results
- High Infrastructure Load: Unnecessary vector database writes consuming compute resources
- Data Sovereignty: Strict requirements for 100% on-premise deployment—no cloud APIs allowed
The existing solution couldn’t scale. Users were frustrated with outdated search results, and IT teams were overwhelmed managing the re-indexing workload.
Solution
We designed and implemented a modular RAG architecture with intelligent incremental updates:
Smart Chunk-Level Upsert System
The breakthrough innovation was our approach to document updates:
- Deterministic Chunk IDs: Each chunk receives a predictable ID based on document path and position
- SHA-256 Content Hashing: Every chunk’s content is hashed to detect actual changes
- Incremental Updates: Only modified chunks are updated in the vector database
- Orphan Cleanup: Deleted content is automatically removed from the index
Architecture Components
Document Processing Pipeline:
- Docling for intelligent document parsing (PDFs, Word, HTML, etc.)
- Preserves document structure, tables, and formatting
- Handles 20+ document formats consistently
Embedding & Retrieval:
- BAAI bge-m3 multilingual embeddings (local deployment)
- Qdrant vector database for high-performance similarity search
- Hybrid search combining semantic and keyword matching
LLM Layer:
- Qwen2.5:7B running fully on-premise
- Optimised for the enterprise’s hardware infrastructure
- Zero external API calls—complete data sovereignty
Orchestration:
- LlamaIndex for RAG pipeline management
- Custom indexing logic for incremental updates
- Query routing for optimal retrieval strategies
Results
The new system transformed document search and knowledge access:
Performance Improvements
| Metric | Before | After | Improvement |
|---|---|---|---|
| Vector DB writes per update | 100% corpus | Under 10% changed | 90%+ reduction |
| Index update time | 4+ hours | Minutes | ~95% faster |
| Query latency | Variable | Under 2 seconds | Consistent performance |
| Concurrent users | 200 max | 2,000+ | 10x capacity |
Operational Benefits
- Zero API Costs: Complete on-premise deployment eliminates ongoing API expenses
- Data Sovereignty: All data stays within enterprise infrastructure
- Reduced Maintenance: Incremental updates mean less system strain
- Scalable Architecture: Modular design allows easy capacity expansion
User Impact
- Unified search across all 8,000 documents
- Always up-to-date results (minutes, not hours)
- Natural language Q&A on internal knowledge base
- 2,000 users accessing simultaneously without degradation
Technical Details
Chunk Hashing Algorithm
The key innovation enabling incremental updates:
import hashlib
def generate_chunk_id(doc_path: str, chunk_index: int) -> str:
"""Deterministic chunk ID for consistent updates."""
return f"{doc_path}::chunk_{chunk_index}"
def hash_chunk_content(content: str) -> str:
"""SHA-256 hash to detect content changes."""
return hashlib.sha256(content.encode()).hexdigest()
def needs_update(chunk_id: str, new_hash: str, existing_hashes: dict) -> bool:
"""Only update if content actually changed."""
return existing_hashes.get(chunk_id) != new_hash
Update Logic
- Parse document → extract chunks
- Generate deterministic chunk IDs
- Hash each chunk’s content
- Compare hashes with stored values
- Upsert only changed chunks
- Delete orphaned chunks (from removed content)
Infrastructure Stack
- Vector DB: Qdrant (self-hosted, clustered for HA)
- Embeddings: BAAI bge-m3 (GPU-accelerated)
- LLM: Qwen2.5:7B (optimised inference)
- Orchestration: LlamaIndex with custom indexing
- Document Processing: Docling pipeline
- Deployment: Kubernetes on-premise
Key Design Decisions
- Deterministic IDs over UUIDs: Enables reliable chunk tracking across updates
- Content Hashing: Prevents unnecessary writes when content hasn’t changed
- Modular Architecture: Each component can be upgraded independently
- Local-First: All models run on-premise for data sovereignty
- Hybrid Search: Combines semantic understanding with keyword precision
Project Details
- Duration: 4 months from kickoff to production
- Team: 5 engineers (2 ML, 2 backend, 1 infrastructure)
- Documents Indexed: 8,000+ and growing
- Users: 2,000 concurrent users
- Deployment: 100% on-premise, zero cloud dependencies
Need to implement RAG for your enterprise documents? Contact us to discuss your knowledge management challenges.