Modular RAG Document Q&A: 90%+ Fewer Vector DB Writes

Challenge

A large enterprise with 8,000+ internal documents and 2,000 users faced critical knowledge management issues:

Scattered Knowledge: Internal documentation spread across multiple systems with no unified search
Full Re-indexing Pain: Every document change triggered complete re-indexing of the entire corpus
Performance Bottleneck: Re-indexing took hours, creating stale search results
High Infrastructure Load: Unnecessary vector database writes consuming compute resources
Data Sovereignty: Strict requirements for 100% on-premise deployment—no cloud APIs allowed

The existing solution couldn’t scale. Users were frustrated with outdated search results, and IT teams were overwhelmed managing the re-indexing workload.

Solution

We designed and implemented a modular RAG architecture with intelligent incremental updates:

Smart Chunk-Level Upsert System

The breakthrough innovation was our approach to document updates:

Deterministic Chunk IDs: Each chunk receives a predictable ID based on document path and position
SHA-256 Content Hashing: Every chunk’s content is hashed to detect actual changes
Incremental Updates: Only modified chunks are updated in the vector database
Orphan Cleanup: Deleted content is automatically removed from the index

Architecture Components

Document Processing Pipeline:

Docling for intelligent document parsing (PDFs, Word, HTML, etc.)
Preserves document structure, tables, and formatting
Handles 20+ document formats consistently

Embedding & Retrieval:

BAAI bge-m3 multilingual embeddings (local deployment)
Qdrant vector database for high-performance similarity search
Hybrid search combining semantic and keyword matching

LLM Layer:

Qwen2.5:7B running fully on-premise
Optimised for the enterprise’s hardware infrastructure
Zero external API calls—complete data sovereignty

Orchestration:

LlamaIndex for RAG pipeline management
Custom indexing logic for incremental updates
Query routing for optimal retrieval strategies

Results

The new system transformed document search and knowledge access:

Performance Improvements

Metric	Before	After	Improvement
Vector DB writes per update	100% corpus	Under 10% changed	90%+ reduction
Index update time	4+ hours	Minutes	~95% faster
Query latency	Variable	Under 2 seconds	Consistent performance
Concurrent users	200 max	2,000+	10x capacity

Operational Benefits

Zero API Costs: Complete on-premise deployment eliminates ongoing API expenses
Data Sovereignty: All data stays within enterprise infrastructure
Reduced Maintenance: Incremental updates mean less system strain
Scalable Architecture: Modular design allows easy capacity expansion

User Impact

Unified search across all 8,000 documents
Always up-to-date results (minutes, not hours)
Natural language Q&A on internal knowledge base
2,000 users accessing simultaneously without degradation

Technical Details

Chunk Hashing Algorithm

The key innovation enabling incremental updates:

import hashlib

def generate_chunk_id(doc_path: str, chunk_index: int) -> str:
    """Deterministic chunk ID for consistent updates."""
    return f"{doc_path}::chunk_{chunk_index}"

def hash_chunk_content(content: str) -> str:
    """SHA-256 hash to detect content changes."""
    return hashlib.sha256(content.encode()).hexdigest()

def needs_update(chunk_id: str, new_hash: str, existing_hashes: dict) -> bool:
    """Only update if content actually changed."""
    return existing_hashes.get(chunk_id) != new_hash

Update Logic

Parse document → extract chunks
Generate deterministic chunk IDs
Hash each chunk’s content
Compare hashes with stored values
Upsert only changed chunks
Delete orphaned chunks (from removed content)

Infrastructure Stack

Vector DB: Qdrant (self-hosted, clustered for HA)
Embeddings: BAAI bge-m3 (GPU-accelerated)
LLM: Qwen2.5:7B (optimised inference)
Orchestration: LlamaIndex with custom indexing
Document Processing: Docling pipeline
Deployment: Kubernetes on-premise

Key Design Decisions

Deterministic IDs over UUIDs: Enables reliable chunk tracking across updates
Content Hashing: Prevents unnecessary writes when content hasn’t changed
Modular Architecture: Each component can be upgraded independently
Local-First: All models run on-premise for data sovereignty
Hybrid Search: Combines semantic understanding with keyword precision

Project Details

Duration: 4 months from kickoff to production
Team: 5 engineers (2 ML, 2 backend, 1 infrastructure)
Documents Indexed: 8,000+ and growing
Users: 2,000 concurrent users
Deployment: 100% on-premise, zero cloud dependencies

Need to implement RAG for your enterprise documents? Contact us to discuss your knowledge management challenges.

Modular RAG Document Q&A: 90%+ Fewer Vector DB Writes

Challenge

Solution

Results

Challenge

Solution

Smart Chunk-Level Upsert System

Architecture Components

Results

Performance Improvements

Operational Benefits

User Impact

Technical Details

Chunk Hashing Algorithm

Update Logic

Infrastructure Stack

Key Design Decisions

Project Details

Technologies Used

Timeline

Ready to achieve similar results?