I’m working on a retrieval augmented generation project that involves processing around 30,000 PDF files. Many of these documents are quite large, with some containing up to 300 pages each. All the PDFs have been processed through OCR, so I’m dealing with plain text content.
My main challenge is with the document ingestion phase into the knowledge base. I initially attempted to use an open webui setup with an ollama backend, but the performance was unacceptably slow for this volume of data.
I’m looking for recommendations on the most effective approaches for handling this scale of document processing. Has anyone successfully implemented a similar large-scale RAG system? What tools, frameworks, or strategies worked best for your use case? Any insights on optimizing the ingestion pipeline would be greatly appreciated.
Been there with similar document volumes - the ingestion bottleneck will absolutely kill your project if you don’t handle it right.
Break this into parallel chunks instead of processing everything sequentially. Split those 300-page docs into smaller sections (10-20 pages each), then run multiple processing pipelines at once.
Don’t build this manually. Coordinating document splitting, embedding generation, vector storage, and error handling gets messy fast. You’ll debug pipeline issues more than actually using your RAG system.
I handled 25k documents last year using automation for the entire flow. Set up parallel streams processing ~500 documents simultaneously, with automatic retry for failed chunks and smart batching for vector database writes.
Automating this lets you monitor progress real-time, pause/resume processing, and scale based on your infrastructure. Plus you get proper error logging so when something breaks (it will), you know exactly which documents need reprocessing.
Latenode makes complex document pipelines like this really straightforward to build and manage. Check it out at https://latenode.com
Had the same scalability nightmare building a RAG system for legal docs last year. Game changer was ditching sequential processing for parallel with a proper queue system. Used Celery + Redis to spread the work across multiple workers - cut processing time from weeks down to 2 days for ~25k documents. Also stopped using fixed-size chunks and parsed by sections/paragraphs instead to keep the meaning intact. Switched from Chroma to Weaviate for the vector DB - handles scale way better. Heads up on memory requirements - you’ll need tons of RAM or go with disk-based storage. Pro tip: test retrieval quality early because what works for hundreds of docs can fall apart at this scale.