LangChain chatbot shows inconsistent behavior across different languages

I built a chatbot using LangChain for Swiss real estate domain knowledge. I trained it with PDF documents and added conversation memory. The bot works fine in multiple languages, but I noticed something strange.

When I ask domain-specific questions in English, it answers correctly using the custom data. When I ask general knowledge questions in English, it says “I don’t know” which seems right since it should only use my training data. But when I ask the same general question in German (like “Was ist die Hauptstadt der Schweiz?”), it suddenly gives the correct answer.

This makes me confused about whether the bot is actually limited to my custom knowledge or if it’s accessing pre-trained information depending on the language used.

My questions:

  • Is this expected behavior or a potential issue?
  • Can I force the chatbot to stick only to my custom data regardless of language?

I couldn’t find this scenario covered in the documentation.

Sample code:

from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

# Load and process documents
loader = TextLoader('property_rules.txt')
documents = loader.load()

# Split text
splitter = CharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=50
)
texts = splitter.split_documents(documents)

# Create embeddings and vector store
embedder = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(texts, embedder)

# Setup QA chain
llm = OpenAI(temperature=0.2)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

# Test queries
response = qa_chain.run("What are tenant responsibilities?")
print(response)

I hit the same problem with a multilingual RAG system. It’s probably your retrieval config, not the language handling. OpenAI’s embeddings work differently across languages - some perform way better than others, which screws up how well your queries match document chunks. When retrieval can’t find good context, the LLM just falls back to what it already knows. Set a minimum similarity threshold with vectorstore.as_retriever(search_kwargs={'score_threshold': 0.7}). If you’re working with multiple languages often, switch to a multilingual embedding model. You could also tweak the prompt to make the chain check if retrieved context actually relates to the question before answering. This inconsistent behavior is super common in production - your retrieval pipeline just needs tuning. It’s not a LangChain bug.

Yeah, I’ve hit this exact problem before. Your embeddings are language dependent, so retrieval quality tanks between languages.

Here’s what’s happening: English queries get good similarity scores and pull relevant context, so the model uses your data. German queries retrieve garbage or nothing useful, so the LLM ignores the weak context and falls back on training data.

I fixed it by switching to multilingual embeddings first. OpenAI’s default ones work but aren’t great for cross-language stuff. I also added a context relevance check - “Only answer if the retrieved context directly addresses the question. Otherwise say ‘I don’t have information about this.’”

Logging retrieval scores during development helped too. Add some debug prints to see similarity scores for the same questions in different languages. You’ll spot the pattern fast.

Quick fix: raise your similarity threshold and make your fallback response identical across languages.

This video covers multilingual embeddings well if you want the technical details.

This inconsistency is super common in production multilingual systems. Fixable but needs tuning.

This happens because your retrieval system can’t find relevant documents for general queries, so the LLM just uses its built-in knowledge instead. The language issue is probably due to how embedding similarity search works - German queries have different semantic patterns that mess with retrieval scores. To fix this, update your chain prompt template to strictly limit answers to retrieved context only. Add a prompt like ‘Answer only from the provided context. If there’s no relevant info, say I don’t have information about this topic.’ You can also tweak the retriever’s similarity threshold or switch to a ‘stuff’ chain type with stricter prompts. RetrievalQA chains let you use custom prompt templates through the chain_type_kwargs parameter. This’ll stop the inconsistent behavior across languages by forcing the model to rely on your documents instead of its training data.

I’ve hit this multilingual mess tons of times. Your retrieval pipeline breaks differently for each language, but there’s a better way than constantly tweaking thresholds and prompts.

Skip the LangChain retrieval headaches - I automated this whole thing with Latenode. Built a flow that preprocesses queries in any language, standardizes them for consistent embedding matching, and sets strict context boundaries automatically.

It detects the language, translates queries to your main training language for better retrieval, then translates responses back. Completely fixes the embedding mismatch problem. You can also set hard rules to block any response that doesn’t come from your documents.

I threw in monitoring that tracks retrieval scores and flags potential hallucinations. The whole system runs itself - no more babysitting similarity thresholds or prompt engineering.

Your code works but you’ll waste forever tweaking edge cases. Automation solves the language inconsistency once and scales to new languages automatically.