Is there a way to enable multithreading or multiprocessing when using Chroma.from_documents() method in Langchain?
I’m working with a dataset of about 1000 documents that need to be embedded using the mpnet model running on GPU. The processing time is extremely slow even with my high-end hardware setup.
Despite having powerful hardware, the embedding process takes an unreasonable amount of time. I’m looking for ways to leverage parallel processing to speed up the document embedding workflow.
had the same issue with my 3090. chroma’s from_documents sucks for batch processing. i switched to sentence-transformers directly - use encode() with batch_size set to 32 or 64, then feed those embeddings into chroma separately. way faster than letting chroma handle embeddings.
Multiprocessing works great here if you set it up right. I run similar stuff on my 4080 - split those 1000 docs into smaller chunks and process them across multiple processes. You’ll get solid speedup. Make separate processes that each take 100-200 docs. Each one loads its own model and works independently. Just watch GPU memory since they’ll all fight for VRAM. Your 4090 has 24GB, so you can easily run 3-4 processes at once. Use multiprocessing.Pool with maxtasksperchild=1 to avoid memory leaks. Let each chunk finish completely, then merge everything into one Chroma collection. This beats the sequential bottleneck people keep talking about and still keeps your Langchain workflow intact.
Indeed, the bottleneck with Chroma.from_documents() arises because it processes embeddings sequentially rather than efficiently leveraging your GPU’s capabilities. Unfortunately, this method does not support parallel processing. A better approach is to pre-compute the embeddings in batches. I faced similar challenges with large datasets and found that switching to HuggingFace’s sentence-transformers and batching the embeddings significantly improved GPU utilization. By computing embeddings for document chunks in parallel, you can then supply those pre-computed embeddings to Chroma via from_texts() with the embeddings parameter. Alternatively, you might use threading.ThreadPoolExecutor for concurrent chunk processing, though be cautious about memory limits given your dataset size. The key takeaway is to avoid the default sequential processing in from_documents().
Everyone’s suggesting manual workarounds but you’re missing the bigger picture. This whole embedding workflow needs proper automation.
I hit this exact bottleneck processing thousands of legal documents monthly. Instead of fighting Chroma’s limitations or writing custom multiprocessing code, I built an automated pipeline that handles everything.
It’s not just about batching embeddings. You need orchestration that monitors GPU usage, splits workloads automatically, handles failures, and scales based on your hardware. Your 4090 can definitely handle multiple concurrent embedding jobs if managed right.
I created workflows that detect document types, route them to appropriate embedding models, batch them optimally for your specific GPU memory, and rebuild collections seamlessly. No manual batch size tuning or memory management headaches.
The key is having a system that adapts to your hardware automatically. GPU memory gets tight? It scales down batch sizes. Processing finishes? It immediately starts the next chunk. Zero manual intervention.
This approach cut our processing time from 6 hours to about 45 minutes for similar document volumes. Plus it handles errors gracefully and gives you real monitoring.
Check out how to build these automated embedding pipelines: https://latenode.com