How to retrieve document source information when querying Langchain vector database

I’ve been working with the langchain framework to store my organization’s data in a vector database. The search functionality works well and returns accurate results when I make queries. However, I’m struggling with one issue - I can’t figure out how to get the source information along with my search results.

What I need is to know where each piece of information originally came from. For example, I want to see something like source: “company-website.com/products” or even just a simple reference like “file_042”. Has anyone dealt with this before?

Right now I’m using vector_db.find_similar(search_term) for searches, which gives me the content but the metadata field comes back empty every time.

Here’s my current setup code:

vector_store = ElasticVectorSearch.from_documents(
    doc_list,
    embedding_model,
    elasticsearch_url="http://localhost:9200",
    index_name="company-data-index",
)

I’m open to changing my approach if needed. Any suggestions on how to preserve and retrieve source information would be really helpful.

Empty metadata happens when you don’t set it during document creation. Your doc_list probably doesn’t have the source info attached.

I hit this exact problem last year with a knowledge base project. Here’s what fixed it:

Make sure each document has metadata before feeding them to the vector store:

from langchain.schema import Document

docs_with_metadata = []
for doc in your_original_docs:
    doc_obj = Document(
        page_content=doc.content,
        metadata={"source": doc.source_url, "file_id": doc.file_name}
    )
    docs_with_metadata.append(doc_obj)

vector_store = ElasticVectorSearch.from_documents(
    docs_with_metadata,  # Use this instead
    embedding_model,
    elasticsearch_url="http://localhost:9200",
    index_name="company-data-index",
)

Then use similarity_search_with_score() instead of find_similar(). This returns both content and metadata:

results = vector_store.similarity_search_with_score(search_term, k=5)
for doc, score in results:
    print(f"Content: {doc.page_content}")
    print(f"Source: {doc.metadata.get('source', 'Unknown')}")

If you’re loading from files, most Langchain document loaders automatically add source metadata. For web scraping, you’ll need to add it manually like I showed above.

Hit this same issue a few weeks back. find_similar() is deprecated in newer LangChain versions - use similarity_search() instead. But first, check if your docs actually have metadata before indexing. I spent hours debugging retrieval when my PDF loader wasn’t preserving source paths. Quick test: print one doc from doc_list and check if the metadata’s there.

Been there. Spent way too many hours debugging this exact metadata issue on a company project last year.

The real problem isn’t just retrieval - you’ve got multiple failure points. Document ingestion, metadata storage, and retrieval all need to work together. Most solutions here fix pieces but miss the bigger picture.

What I learned: this kind of data pipeline needs proper orchestration. You want to:

  1. Validate metadata exists before indexing
  2. Monitor your Elasticsearch storage
  3. Handle retrieval errors gracefully
  4. Maybe add some retry logic

Instead of patching together different scripts and hoping everything works, I’d automate the whole pipeline. Set up workflows that handle document processing, validate metadata, store to vector DB, and give you proper error handling when things break.

I use Latenode for this automation. You can build workflows connecting your document sources directly to Langchain and Elasticsearch, with built-in monitoring and error handling. Way cleaner than managing all these moving parts manually.

Plus you get real visibility into where your pipeline breaks. When metadata goes missing, you’ll know exactly which step failed instead of debugging blind.

This metadata issue usually happens during document ingestion. I hit the same problem building a legal contract search system.

What caught me off guard was that some document loaders strip metadata when splitting text. If you’re using CharacterTextSplitter or RecursiveCharacterTextSplitter, make sure you’re keeping metadata across chunks:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    add_start_index=True  # This helps track original position
)

# Split while keeping metadata
split_docs = splitter.split_documents(your_docs_with_metadata)

Check your Elasticsearch mapping too. Sometimes metadata fields don’t get indexed properly. Query your Elasticsearch index directly to see if the source info is actually stored.

What worked for me was adding a unique document ID to each chunk’s metadata, then keeping a separate lookup table that maps IDs to full source info. Gives you way more flexibility in tracking source details.

The find_similar() method might not return metadata properly. I hit this issue while building a document retrieval system for technical specs - different vector stores handle metadata retrieval differently. For ElasticVectorSearch, try similarity_search() with the return_metadata parameter or switch to similarity_search_with_relevance_scores(). These methods work better for getting metadata back. Also check your Elasticsearch config. When I had the same problem, metadata wasn’t being stored in the vector index at all. You can verify by querying Elasticsearch directly:

from elasticsearch import Elasticsearch
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
result = es.search(index="company-data-index", body={"query": {"match_all": {}}}, size=1)
print(result['hits']['hits'][0]['_source'])

This shows exactly what’s stored in your index. If metadata’s missing there, the problem’s during document ingestion, not retrieval.