How to add new document embeddings to existing ChromaDB instance in Langchain

I built a QA system using Langchain with ChromaDB as my vector database. Right now it has embeddings from one document called “data.txt”. I want to add embeddings from another file “extra.txt” to the same ChromaDB instance without rebuilding everything from scratch.

Is there a way to keep my existing embeddings and just add new ones? I don’t want to process the first file again. I just need to expand my current vector database with additional document embeddings and use them all together for retrieval.

file_loader = UnstructuredFileLoader('data.txt', mode='elements')
docs = file_loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=100)
chunks = splitter.split_documents(docs)
embedding_model = OpenAIEmbeddings()
vector_store = Chroma.from_documents(chunks, embedding_model)
qa_chain = RetrievalQA.from_chain_type(llm=OpenAI(temperature=0.1), chain_type="stuff", retriever=vector_store.as_retriever(search_type="mmr"), return_source_documents=True)

ChromaDB handles this perfectly. Just add new embeddings to your existing collection without messing with the old ones.

Use add_documents() instead of from_documents() for your new file. Here’s how:

# Load your new document
file_loader = UnstructuredFileLoader('extra.txt', mode='elements')
new_docs = file_loader.load()

# Split it the same way
splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=100)
new_chunks = splitter.split_documents(new_docs)

# Add to existing vector store
vector_store.add_documents(new_chunks)

I’ve done this tons of times when clients expand their knowledge base. Existing embeddings stay untouched - you’re just appending new ones.

Stick with the same embedding model and text splitter settings or you’ll get weird retrieval results.

Your QA chain picks up the expanded vector store automatically since it’s referencing the same ChromaDB instance.

Everyone’s missing the persistence issue. If you don’t set a persist_directory when creating your ChromaDB instance, your embeddings only live in memory and vanish when the process dies. I wasted hours adding documents just to lose everything on restart. Always specify a persist directory: ```python
vector_store = Chroma(persist_directory=“./chroma_db”, embedding_function=embedding_model)

I used to do this manually until I realized I was wasting tons of time on repetitive stuff.

The add_documents approach works, but you’ll hit scaling problems once you’re processing dozens of files regularly. Tracking which files you’ve already embedded becomes a nightmare.

I automated the whole thing and it’s been a game changer. Set up triggers to watch for new files, automatically process them through your chunking pipeline, and dump them into ChromaDB.

You can add validation to prevent duplicates, auto-cleanup when source files change, and parallel processing for multiple files.

I run similar pipelines for our knowledge management system. Once you automate file monitoring, text processing, and vector store updates, you’re done. Never think about it again.

Your QA chain keeps working exactly the same while your vector database grows automatically.

Check out how to build this automation: https://latenode.com

ChromaDB auto-saves when you use add_documents() - that’s all you need. Just make sure you’re connecting to your existing collection, not creating a new one.

I screwed this up initially by accidentally making a new collection instead of accessing the existing one. Use the same collection name and persist directory as your original setup.

# Connect to existing ChromaDB instance
vector_store = Chroma(persist_directory="your_persist_dir", embedding_function=OpenAIEmbeddings(), collection_name="your_collection_name")

# Process your new file
file_loader = UnstructuredFileLoader('extra.txt', mode='elements')
new_docs = file_loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=100)
new_chunks = splitter.split_documents(new_docs)

# Add to existing collection
vector_store.add_documents(new_chunks)

Retrieval quality stays solid if you use the same chunking parameters. I’ve added hundreds of docs this way without problems.

yeah, you can totally do this! just make sure your embedding models match. switching models from data.txt to extra.txt can mess up retrieval since the vector spaces won’t align. I had a headache upgrading models mid-way and had to reprocess everything. ChromaDB gives new chunks auto IDs, so no conflicts there.