I’m working with a massive document collection that includes around 30,000 PDF files. Many of these documents are quite lengthy, with some reaching up to 300 pages per file. All the content has been processed through OCR, so I’m dealing with plain text data.
My main challenge is with the ingestion process into the knowledge base. I attempted to use open webui paired with an ollama backend, but the performance was unacceptably slow for this volume of data.
Has anyone tackled a similar large-scale document processing project? I’d really appreciate hearing about your experience and what strategies worked best for handling this amount of data efficiently. What tools or approaches would you recommend for better performance?
I dealt with something similar - about 25,000 technical manuals, roughly 200 pages each. Game changer was ditching sequential processing for distributed processing. Set up Elasticsearch with a custom pipeline that batches documents across multiple nodes. Initial indexing took 3 days but queries are crazy fast now. For OCR text, clean up the common artifacts before ingestion - it’ll boost your search accuracy big time. Also, go with incremental updates instead of reprocessing everything when docs change. Saved me weeks during updates. The infrastructure cost upfront was totally worth the performance boost.
Been there with enterprise document workflows. The real bottleneck isn’t volume - it’s how you handle the entire pipeline.
You need smart automation doing the heavy lifting. Don’t throw everything at ollama one by one. Use parallel processing with intelligent batching. Split those 30k PDFs into chunks and automate everything - extraction, cleaning, indexing, storage.
Build a workflow that watches performance and adjusts batch sizes on the fly. When part of your pipeline gets slammed, it should auto-scale or redistribute work.
I tackled something similar with automated workflows that processed docs in parallel streams, cleaned OCR mess in real-time, and pushed clean data to multiple knowledge bases at once. Went from weeks to hours.
At your scale, you need something handling complex document workflows without constant babysitting. Latenode does exactly this - manages your entire pipeline from PDF processing to knowledge base ingestion, with built-in scaling and monitoring.
yeah, def try chunking ur docs! breaking em down into smaller bits can help a ton. also, make sure ur hardware can do parallel processing, not just sequential. makes a big diff with that much data!