Best practices for handling RAG systems with massive document collections?

Hey everyone,

I’m working on a retrieval augmented generation project that needs to handle around 20 million documents. This is my first time dealing with such a large scale and I’m running into some challenges.

I’m particularly struggling with three main areas. First is the response time - queries are taking way too long to process. Second is the vector embeddings - generating and storing embeddings for this many documents seems overwhelming. Third is the indexing strategy - I’m not sure what approach works best for this volume.

Has anyone here built RAG applications at this scale before? I’d really appreciate hearing about your experiences and what techniques you used to make it work efficiently. Any tips on architecture choices or tools that helped you would be awesome.

Thanks in advance!

Had a similar 20M project last year - learned some painful lessons. Game changer was hierarchical clustering before vector search. Instead of searching all embeddings directly, we clustered docs into ~10k groups, found relevant clusters first, then searched within those. Cut query times from 8+ seconds to under 2. For embeddings, we batch processed with GPU clusters and cached everything in distributed storage. Initial processing took a week, but updates are easy now. Try smaller embedding models if you can - we went from 1536 to 768 dimensions with barely any quality loss but huge storage/speed gains. Don’t make our early mistake of keeping everything in memory. Disk-based indices with smart caching work way better at this scale.

the real bottleneck at 20m docs isn’t vector search - it’s keepin embeddings fresh when docs change. we switched to incremental updates instead of full reindexing every time. also try hybrid search - combine keyword matching with semantic search to filter candidates before running expensive vector ops.

Scale like this needs a totally different approach. I stopped fighting vector search limitations and automated everything to work smarter.

I built a system that auto-partitions documents by content type and metadata first. Then it runs parallel embedding generation across multiple workers - but only for docs that actually get queried often. Cold documents sit in cheap storage until needed.

The game-changer was automating index selection. The system watches query patterns and automatically switches between dense and sparse indices based on what performs better for each document cluster. No more guessing games.

At 20M scale, you need automated data lifecycle management. Documents get promoted or demoted between hot, warm, and cold tiers based on how often they’re accessed. This keeps your active search space manageable while everything stays reachable.

Response times went from minutes to seconds once I stopped manually optimizing and let automation handle the complexity. The system learns and adapts way better than any static setup.

Latenode handles all the orchestration perfectly. I built the entire pipeline there and it manages worker scaling, data movement, and index switching automatically.

Twenty million documents? You need to completely rethink your storage setup. Everyone obsesses over vector search optimization, but the real problem is how you organize and grab data before it even gets to the embedding stage. I built a three-tier system that made a huge difference. First, I preprocess documents into semantic chunks with basic metadata tags. This creates clean boundaries so you can dump irrelevant stuff early. The big realization was treating document retrieval and vector similarity as totally separate problems. For embeddings, use approximate nearest neighbor algorithms like HNSW or IVF instead of exact search. Quality is basically the same but performance jumps way up. We also pre-filter using document metadata before doing vector comparisons - cut our search space by about 80% on most queries. Infrastructure-wise, you’ve got to scale horizontally. Split your index across multiple machines and add query routing logic. Database sharding works great here - partition by document type, date ranges, or content domains depending on what you’re doing. Here’s something most people miss: bigger embedding models don’t always give better results at scale. We switched to a smaller specialized model trained on our specific domain and got both faster speeds and better relevance scores.