Chroma Vector Database with Langchain Fails to Store Beyond 99 Embeddings

CharlieLion22 · August 17, 2025, 5:28pm

I’m working with a document processing pipeline that reads text files, splits them into smaller pieces, and stores embeddings in a Chroma vector database. My setup processes about 650 text chunks from a single document, but I keep hitting a weird limit where only 99 embeddings get saved.

def read_files():
    file_loader = DirectoryLoader(SOURCE_DIR, glob="*.txt")
    docs = file_loader.load()
    return docs

def chunk_content(docs: list[Document]):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=250,
        chunk_overlap=50,
        length_function=len,
        add_start_index=True,
    )
    text_chunks = splitter.split_documents(docs)
    print(f"Created {len(text_chunks)} chunks from {len(docs)} files.")
    
    sample = text_chunks[5]
    print(sample.page_content)
    print(sample.metadata)
    
    return text_chunks

def store_in_chroma(text_chunks: list[Document]):
    # Remove existing database
    if os.path.exists(DB_PATH):
        shutil.rmtree(DB_PATH)
    
    # Build new database from chunks
    vector_db = Chroma.from_documents(
        text_chunks, OpenAIEmbeddings(), persist_directory=DB_PATH
    )

I tried processing chunks individually in a loop, but the same issue occurs. When I check the SQLite database directly, I can see exactly 99 records in the embeddings table, even though my code should be saving way more. Strangely, I can manually insert additional records using a database browser tool.

I’ve experimented with different approaches like processing smaller batches, keeping the existing database instead of clearing it, and adjusting chunk parameters, but nothing works. Has anyone encountered this 99-record limitation before?

DancingFox · August 22, 2025, 10:12am

You’re hitting Chroma’s batch processing limit - it defaults to 100 embeddings at once. I’ve run into this before and it’s usually how the bulk processing works. Here’s what fixed it for me: modify your store_in_chroma function to handle smaller batches. Set up the Chroma database first, then use add_documents to insert maybe 50 documents at a time with short pauses between batches to avoid throttling. Also watch your OpenAI API limits - they can fail silently and you won’t notice until later. Add some logging to catch request failures, especially after that first batch processes.

avaw · August 22, 2025, 4:32am

Been there, done that. Same exact thing happened to me last year building a document search system for our internal knowledge base.

It’s not just batch limits. Chroma has this weird quirk where it silently truncates at 99 records when there’s a mismatch between embedding dimensions and what it expects. Check if your OpenAI embeddings are generating consistent dimensions for all chunks.

What worked for me was adding explicit error handling and using add_texts instead of from_documents. Try this:

db = Chroma(persist_directory=DB_PATH, embedding_function=OpenAIEmbeddings())
texts = [chunk.page_content for chunk in text_chunks]
metadata = [chunk.metadata for chunk in text_chunks]
db.add_texts(texts=texts, metadatas=metadata)

Run a quick test - try storing exactly 100 chunks first. If that works but 101 fails, you know it’s the batch limit. If 100 also gets truncated to 99, it’s definitely the dimension mismatch issue.

One more thing - check your SQLite database file permissions. Sometimes partial writes happen when the process doesn’t have full write access to the directory.

FlyingLeaf · August 21, 2025, 10:37pm

Classic pagination issue with Chroma. I’ve hit this before with large datasets - the default query limit messes with bulk operations. Set an explicit batch_size parameter in your from_documents call. Also check if you’re hitting memory limits. Chroma quietly drops records when it can’t allocate enough memory for big embedding jobs. Don’t forget to check your OpenAI embedding rate limits too - failed API calls during bulk processing cause partial saves without obvious errors. I’d add a verification step after storing to compare actual vs expected record counts.