Skip to main content
Enterprise Knowledge Management Large Enterprise

Modular RAG Document Q&A: 90%+ Fewer Vector DB Writes

Challenge

Internal knowledge scattered across 8,000 documents with full re-indexing required every time a document changed—causing slow updates and high computational costs.

Solution

Implemented smart chunk-level upsert with deterministic chunk IDs and SHA-256 hashing for incremental updates, deployed 100% on-premise.

Results

90%+ fewer vector DB writes
100% on-premise deployment
Zero API costs
2,000 concurrent users supported

Challenge

A large enterprise with 8,000+ internal documents and 2,000 users faced critical knowledge management issues:

  • Scattered Knowledge: Internal documentation spread across multiple systems with no unified search
  • Full Re-indexing Pain: Every document change triggered complete re-indexing of the entire corpus
  • Performance Bottleneck: Re-indexing took hours, creating stale search results
  • High Infrastructure Load: Unnecessary vector database writes consuming compute resources
  • Data Sovereignty: Strict requirements for 100% on-premise deployment—no cloud APIs allowed

The existing solution couldn’t scale. Users were frustrated with outdated search results, and IT teams were overwhelmed managing the re-indexing workload.

Solution

We designed and implemented a modular RAG architecture with intelligent incremental updates:

Smart Chunk-Level Upsert System

The breakthrough innovation was our approach to document updates:

  1. Deterministic Chunk IDs: Each chunk receives a predictable ID based on document path and position
  2. SHA-256 Content Hashing: Every chunk’s content is hashed to detect actual changes
  3. Incremental Updates: Only modified chunks are updated in the vector database
  4. Orphan Cleanup: Deleted content is automatically removed from the index

Architecture Components

Document Processing Pipeline:

  • Docling for intelligent document parsing (PDFs, Word, HTML, etc.)
  • Preserves document structure, tables, and formatting
  • Handles 20+ document formats consistently

Embedding & Retrieval:

  • BAAI bge-m3 multilingual embeddings (local deployment)
  • Qdrant vector database for high-performance similarity search
  • Hybrid search combining semantic and keyword matching

LLM Layer:

  • Qwen2.5:7B running fully on-premise
  • Optimised for the enterprise’s hardware infrastructure
  • Zero external API calls—complete data sovereignty

Orchestration:

  • LlamaIndex for RAG pipeline management
  • Custom indexing logic for incremental updates
  • Query routing for optimal retrieval strategies

Results

The new system transformed document search and knowledge access:

Performance Improvements

MetricBeforeAfterImprovement
Vector DB writes per update100% corpusUnder 10% changed90%+ reduction
Index update time4+ hoursMinutes~95% faster
Query latencyVariableUnder 2 secondsConsistent performance
Concurrent users200 max2,000+10x capacity

Operational Benefits

  • Zero API Costs: Complete on-premise deployment eliminates ongoing API expenses
  • Data Sovereignty: All data stays within enterprise infrastructure
  • Reduced Maintenance: Incremental updates mean less system strain
  • Scalable Architecture: Modular design allows easy capacity expansion

User Impact

  • Unified search across all 8,000 documents
  • Always up-to-date results (minutes, not hours)
  • Natural language Q&A on internal knowledge base
  • 2,000 users accessing simultaneously without degradation

Technical Details

Chunk Hashing Algorithm

The key innovation enabling incremental updates:

import hashlib

def generate_chunk_id(doc_path: str, chunk_index: int) -> str:
    """Deterministic chunk ID for consistent updates."""
    return f"{doc_path}::chunk_{chunk_index}"

def hash_chunk_content(content: str) -> str:
    """SHA-256 hash to detect content changes."""
    return hashlib.sha256(content.encode()).hexdigest()

def needs_update(chunk_id: str, new_hash: str, existing_hashes: dict) -> bool:
    """Only update if content actually changed."""
    return existing_hashes.get(chunk_id) != new_hash

Update Logic

  1. Parse document → extract chunks
  2. Generate deterministic chunk IDs
  3. Hash each chunk’s content
  4. Compare hashes with stored values
  5. Upsert only changed chunks
  6. Delete orphaned chunks (from removed content)

Infrastructure Stack

  • Vector DB: Qdrant (self-hosted, clustered for HA)
  • Embeddings: BAAI bge-m3 (GPU-accelerated)
  • LLM: Qwen2.5:7B (optimised inference)
  • Orchestration: LlamaIndex with custom indexing
  • Document Processing: Docling pipeline
  • Deployment: Kubernetes on-premise

Key Design Decisions

  1. Deterministic IDs over UUIDs: Enables reliable chunk tracking across updates
  2. Content Hashing: Prevents unnecessary writes when content hasn’t changed
  3. Modular Architecture: Each component can be upgraded independently
  4. Local-First: All models run on-premise for data sovereignty
  5. Hybrid Search: Combines semantic understanding with keyword precision

Project Details

  • Duration: 4 months from kickoff to production
  • Team: 5 engineers (2 ML, 2 backend, 1 infrastructure)
  • Documents Indexed: 8,000+ and growing
  • Users: 2,000 concurrent users
  • Deployment: 100% on-premise, zero cloud dependencies

Need to implement RAG for your enterprise documents? Contact us to discuss your knowledge management challenges.

Technologies Used

Docling Qdrant BAAI bge-m3 Qwen2.5:7B LlamaIndex

Timeline

4 months delivery

Ready to achieve similar results?

Let's discuss how we can help your business succeed with AI.