LangChain throws ValueError about missing input keys including '_id' field

I’m working on a MongoDB query generator using LangChain and OpenAI’s GPT model but running into issues. Here’s my code:

openai_llm = ChatOpenAI(openai_api_key=API_KEY, model='gpt-3.5-turbo-0125')
database_schema = AppConfig.DB_SCHEMA
schema_info = AppConfig.SCHEMA_INFO
example_json = AppConfig.SAMPLE_OUTPUT
example_string = json.dumps(example_json)

query_prompt_template = """
You are a MongoDB expert with extensive experience in NoSQL database queries.
I'll give you a database_schema and schema_info in a specific format.
Your job is to read the user_input and generate a MongoDB aggregation pipeline.

Database Schema: """ + database_schema + """
Schema Info: """ + schema_info + """

Example:
Input: Get all films by Christopher Nolan with actor count per film
Output: {sample_output}

Return only the query without extra text.
Input: {user_input}
"""

query_prompt = PromptTemplate(
    template=query_prompt_template,
    input_variables=["user_input", "sample_output"],
)

chain = LLMChain(llm=openai_llm, prompt=query_prompt, verbose=True)

def generate_query(user_input):
    result = chain.invoke({
        "user_input": user_input,
        "sample_output": example_string
    })
    
    clean_result = result['text'].replace("Output: ", "")
    regex_pattern = r'db\.collectionName\.aggregate\(\s*\['
    cleaned_output = re.sub(regex_pattern, '', clean_result)
    
    return json.loads(cleaned_output)

mongo_client = pymongo.MongoClient(AppConfig.CONNECTION_STRING)
database = mongo_client[AppConfig.DATABASE_NAME]
collection = database[AppConfig.COLLECTION_NAME]
test_query = generate_query(user_input="what year was TechCorp established?")

The error I get is:

ValueError: Missing some input keys: {'\n        "_id"'}

I’m trying to build a system that creates MongoDB aggregation pipelines automatically based on natural language questions. The PromptTemplate should work with the input variables I’m providing, but something seems wrong with the key matching. Any ideas what could be causing this mismatch?

Hit this exact ValueError last month building a MongoDB query system. It’s definitely the template variable parsing, but there’s another angle.

Your PromptTemplate declares input_variables=["user_input", "sample_output"] but your template has hardcoded schema info with JSON objects full of curly braces. LangChain scans the whole template string for brace patterns no matter where they are.

Cleanest fix I found was partial_variables for static content:

query_prompt = PromptTemplate(
    template=query_prompt_template,
    input_variables=["user_input", "sample_output"],
    partial_variables={
        "static_schema": database_schema,
        "static_info": schema_info
    }
)

Just reference {static_schema} and {static_info} in your template instead of concatenating. This keeps the problematic JSON away from the template parser while your invoke calls stay simple. Way better than manually escaping everything, especially when your schema changes a lot.

This happens because LangChain thinks it found an input variable when it sees the "_id" field in your database schema or example data.

LangChain uses curly braces {} to spot template variables. Somewhere in your database_schema, schema_info, or example_json, you’ve got MongoDB field names with underscores wrapped in quotes.

I hit this exact issue last year building a similar query generator. You need to either escape the curly braces in your static content or change how you pass the data.

Here’s what worked for me:

query_prompt_template = """
You are a MongoDB expert with extensive experience in NoSQL database queries.
I'll give you a database_schema and schema_info in a specific format.
Your job is to read the user_input and generate a MongoDB aggregation pipeline.

Database Schema: {database_schema}
Schema Info: {schema_info}

Example:
Input: Get all films by Christopher Nolan with actor count per film
Output: {sample_output}

Return only the query without extra text.
Input: {user_input}
"""

query_prompt = PromptTemplate(
    template=query_prompt_template,
    input_variables=["user_input", "sample_output", "database_schema", "schema_info"],
)

Then update your invoke call:

result = chain.invoke({
    "user_input": user_input,
    "sample_output": example_string,
    "database_schema": database_schema,
    "schema_info": schema_info
})

This stops LangChain from trying to parse your schema content for template variables.

LangChain’s template parser sees curly braces in your concatenated strings and thinks they’re input variables. When you concat database_schema, schema_info, and other content directly into the template, any JSON objects or MongoDB field references with braces get parsed as template variables. I hit the same thing working with database schemas that had JSON examples. The parser found patterns like {"_id": ObjectId(...)} in my schema docs and treated them as missing variables. Escape the problem braces in your static content by doubling them up. If your schema has {"_id"}, change it to {{"_id"}}. This tells LangChain’s parser they’re literal characters, not variable placeholders. You could also ditch the direct string concatenation for proper template variables like mentioned above, but if you want a quick fix without restructuring, escaping should solve it.

Been fighting LangChain templating for months too. Yeah, it’s exactly what everyone said - MongoDB’s curly braces mess with the template variables.

But you’re building something that’ll break every time your schema changes. Add one field with braces and you’re back to escaping characters again.

I ditched LangChain for this and used Latenode instead. My flow:

  • User input comes through webhook
  • Send it to OpenAI with schema as separate parameters (no template parsing)
  • Strip extra text with a JS node
  • Test the query directly in MongoDB

No more brace conflicts. No more escaping every schema change. Everything runs automatically and I can tweak parts without breaking the prompt.

Took 30 minutes to build what I’d been debugging for days. Better error handling too, plus easy validation and query caching.