I found some documentation about creating a custom wrapper that adds scores to document metadata, but this seems odd to me. Why would retrieval scores become permanent metadata? I’m also confused about when to use invoke() versus similarity_search_with_score().
Another thing I’m struggling with is getting similarity scores from other retriever types like BM25Retriever and EnsembleRetriever. Do these support score retrieval in the same way as vector stores? Any guidance on the best approach would be helpful.
Here’s what I learned building retrieval systems at scale - you’ve got a classic interface mismatch problem.
You’re stuck manually juggling different APIs for each retriever type. Switch from FAISS to BM25 to ensemble methods? You need different code paths and score handling logic every time.
I hit this exact headache building a document search system that needed to compare results across multiple retrieval methods. Instead of fighting langchain’s limitations, I automated the whole thing.
Set up a workflow that handles all retriever types through one interface. Configure different similarity thresholds, automatically normalize scores across methods, and combine results from multiple retrievers without writing custom ensemble logic.
The workflow pulls similarity scores whether you’re using vector similarity, BM25 term frequency, or hybrid approaches. No more switching between invoke() and similarity_search_with_score() or dealing with inconsistent score formats.
You get clean, structured outputs with documents and normalized scores, plus automatic fallback when certain retrievers don’t support scoring.
This scales way better than hardcoding different retriever APIs in your application code.
The invoke() vs similarity_search_with_score() confusion comes down to langchain’s design choices. The Retriever interface tries to be universal, so it drops specific features like similarity scores. Use invoke() and you get basic Documents without scores. But the vector store underneath still has those scores through its own methods. I’d stick with the retriever for simple fetches, then hit the vector store directly when you need scores. FAISS gives you cosine similarity from 0 to 2 - lower numbers mean higher similarity. Different vector stores score differently though, so watch out for normalization if you switch backends. Your metadata workaround might look weird, but it’s actually pretty smart for workflows where you’re passing scored results through retriever chains.
i feel you on the frustration! yeah, invoke() won’t give you scores - super annoying. try db.similarity_search_with_score(query, k=4) instead. that’ll get you the (doc, score) tuples you need. bm25 won’t work though - it doesn’t support scores at all.
Yeah, the metadata confusion makes sense - adding scores that way does feel hacky. You’re hitting a limitation of the Retriever interface. It’s built to be generic, which means it can’t handle all the specific features each method offers. For vector stores like FAISS, skip the retriever completely. Call the similarity methods straight on the database object instead. Use similarity_search_with_score() - it’ll give you tuples with documents and their cosine similarity scores. There’s also max_marginal_relevance_search() if you want more diverse results. BM25Retriever is different though. It uses term frequency scoring, not vector similarity, so you can’t use the same score retrieval tricks. EnsembleRetriever is even messier since it mixes multiple methods and normalizes everything internally. My advice: work directly with the vector store when you need scores. Only use the retriever interface when scores don’t matter for your app logic.