Implementing parallel processing with Chroma.from_documents() in Langchain

lucask · August 8, 2025, 3:47am

Is there a way to enable multithreading or multiprocessing when using Chroma.from_documents() method in Langchain?

I’m working with a dataset of about 1000 documents that need to be embedded using the mpnet model running on GPU. The processing time is extremely slow even with my high-end hardware setup.

My current environment:

OS: Ubuntu 20.04 running on Windows 11 WSL2
Langchain: v0.0.253
PyTorch: 2.0.1+cu118
Chroma: v0.4.2
CUDA: 11.8
CPU: Intel i9-13900k (5.4GHz P-cores, 4.3GHz E-cores)
GPU: RTX 4090

Despite having powerful hardware, the embedding process takes an unreasonable amount of time. I’m looking for ways to leverage parallel processing to speed up the document embedding workflow.

emcarter · August 16, 2025, 2:29am

had the same issue with my 3090. chroma’s from_documents sucks for batch processing. i switched to sentence-transformers directly - use encode() with batch_size set to 32 or 64, then feed those embeddings into chroma separately. way faster than letting chroma handle embeddings.

ameliat · August 15, 2025, 2:25pm

Multiprocessing works great here if you set it up right. I run similar stuff on my 4080 - split those 1000 docs into smaller chunks and process them across multiple processes. You’ll get solid speedup. Make separate processes that each take 100-200 docs. Each one loads its own model and works independently. Just watch GPU memory since they’ll all fight for VRAM. Your 4090 has 24GB, so you can easily run 3-4 processes at once. Use multiprocessing.Pool with maxtasksperchild=1 to avoid memory leaks. Let each chunk finish completely, then merge everything into one Chroma collection. This beats the sequential bottleneck people keep talking about and still keeps your Langchain workflow intact.

OwenNebula55 · August 14, 2025, 7:19pm

Indeed, the bottleneck with Chroma.from_documents() arises because it processes embeddings sequentially rather than efficiently leveraging your GPU’s capabilities. Unfortunately, this method does not support parallel processing. A better approach is to pre-compute the embeddings in batches. I faced similar challenges with large datasets and found that switching to HuggingFace’s sentence-transformers and batching the embeddings significantly improved GPU utilization. By computing embeddings for document chunks in parallel, you can then supply those pre-computed embeddings to Chroma via from_texts() with the embeddings parameter. Alternatively, you might use threading.ThreadPoolExecutor for concurrent chunk processing, though be cautious about memory limits given your dataset size. The key takeaway is to avoid the default sequential processing in from_documents().

danielr · August 14, 2025, 2:32am

Your RTX 4090’s getting wasted - Chroma processes docs one at a time. Had the same issue last year with my setup.

Ditch from_documents entirely. Pull your text first, then hit SentenceTransformer directly:

from sentence_transformers import SentenceTransformer
import chromadb

model = SentenceTransformer('all-mpnet-base-v2', device='cuda')
embeddings = model.encode(texts, batch_size=128, show_progress_bar=True)

With that 4090, crank batch_size to 128 or 256. I got 10x faster processing vs letting Chroma handle it.

Then build your Chroma collection manually:

client = chromadb.Client()
collection = client.create_collection("docs")
collection.add(embeddings=embeddings, documents=texts, ids=ids)

This actually uses your GPU instead of leaving it idle. Cut my processing from hours to minutes.

priya.in.pixels · August 13, 2025, 2:14pm

Everyone’s suggesting manual workarounds but you’re missing the bigger picture. This whole embedding workflow needs proper automation.

I hit this exact bottleneck processing thousands of legal documents monthly. Instead of fighting Chroma’s limitations or writing custom multiprocessing code, I built an automated pipeline that handles everything.

It’s not just about batching embeddings. You need orchestration that monitors GPU usage, splits workloads automatically, handles failures, and scales based on your hardware. Your 4090 can definitely handle multiple concurrent embedding jobs if managed right.

I created workflows that detect document types, route them to appropriate embedding models, batch them optimally for your specific GPU memory, and rebuild collections seamlessly. No manual batch size tuning or memory management headaches.

The key is having a system that adapts to your hardware automatically. GPU memory gets tight? It scales down batch sizes. Processing finishes? It immediately starts the next chunk. Zero manual intervention.

This approach cut our processing time from 6 hours to about 45 minutes for similar document volumes. Plus it handles errors gracefully and gives you real monitoring.

Check out how to build these automated embedding pipelines: https://latenode.com