I built a document processing pipeline that extracts key information from research papers. The system works perfectly on the first try but then starts giving wrong information on later attempts.
My Setup:
I’m using a vector database to store document chunks and an LLM to extract paper details like title, authors, summary, and publication date. The weird part is that the first execution always works correctly, but subsequent runs produce incorrect data even though the input file hasn’t changed.
Code Example:
import hashlib
from typing import List
def build_document_store(text_segments, embed_model, storage_path):
# Create content-based identifiers
segment_ids = [hashlib.md5(segment.page_content.encode()).hexdigest()
for segment in text_segments]
# Remove duplicates
seen_ids = set()
filtered_segments = []
for segment, seg_id in zip(text_segments, segment_ids):
if seg_id not in seen_ids:
seen_ids.add(seg_id)
filtered_segments.append(segment)
# Build vector store
doc_store = ChromaDB.from_documents(
documents=filtered_segments,
ids=list(seen_ids),
embedding=embed_model,
persist_directory=storage_path
)
return doc_store
Output Structure:
from pydantic import BaseModel
class DocumentInfo(BaseModel):
title: str
abstract: str
year: str
authors: str
Why does my system give accurate results initially but then fail on repeated runs? How can I make it consistently reliable?
This is happening because of document ID conflicts in your vector store. When you regenerate the same MD5 hashes on repeat runs, ChromaDB thinks you’re updating existing documents instead of adding new ones. This messes up your embeddings and creates weird associations.
I hit this exact issue processing legal docs. Fixed it by managing collections properly - either delete the existing collection before rebuilding or add unique run identifiers to your document IDs.
Try adding a timestamp to your ID generation:
run_id = str(int(time.time()))
segment_ids = [f"{run_id}_{hashlib.md5(segment.page_content.encode()).hexdigest()}" for segment in text_segments]
Or just delete the collection before each rebuild. First run works fine because the collection’s empty, but later runs get corrupted state from previous executions.
sounds like ur db is caching stuff. try clearing the persist_directory or use a temp folder each time. had the same issue with phi-embedding, old info messes up new runs!
Your vector similarity search is probably the culprit. ChromaDB returns different chunks each time you run it because of slight embedding variations or search parameter differences.
I’ve hit this same issue. Vector search isn’t deterministic - especially without a fixed seed or when similarity scores tie.
Make your retrieval consistent with these parameters:
# When querying the vector store
results = doc_store.similarity_search(
query=your_query,
k=5, # fixed number of results
search_kwargs={'score_threshold': 0.7} # consistent threshold
)
Double-check you’re loading the same persisted database every time. Recreating the ChromaDB instance instead of loading the existing one causes inconsistency.
One more thing - set temperature=0 in your model config to make LLM calls deterministic. Random sampling creates different outputs even with identical context.
Check if you’re actually loading the persisted ChromaDB correctly on subsequent runs. I’ve seen this where the code looks like it’s loading from persist_directory but creates a fresh instance instead. First run works since it builds everything from scratch, but later runs might not find the persisted data.
Another issue could be your document chunking - if you’re re-splitting the same document differently each time because of random elements in your text processing, you’ll get different chunks with different embeddings even from identical source files. Make sure your text splitting parameters are completely deterministic.
Also check that your embedding model isn’t adding randomness. Some models have internal dropout or other random elements that give the same text slightly different embeddings across runs, which messes with your retrieval results.
Check your LLM prompt too - if it’s not structured properly, the model might pick up different context clues each time. Also make sure you’re not accidentally appending to existing ChromaDB collections instead of replacing them completely.