I’m working on a MongoDB query generator using LangChain and OpenAI’s GPT model but running into issues. Here’s my code:
openai_llm = ChatOpenAI(openai_api_key=API_KEY, model='gpt-3.5-turbo-0125')
database_schema = AppConfig.DB_SCHEMA
schema_info = AppConfig.SCHEMA_INFO
example_json = AppConfig.SAMPLE_OUTPUT
example_string = json.dumps(example_json)
query_prompt_template = """
You are a MongoDB expert with extensive experience in NoSQL database queries.
I'll give you a database_schema and schema_info in a specific format.
Your job is to read the user_input and generate a MongoDB aggregation pipeline.
Database Schema: """ + database_schema + """
Schema Info: """ + schema_info + """
Example:
Input: Get all films by Christopher Nolan with actor count per film
Output: {sample_output}
Return only the query without extra text.
Input: {user_input}
"""
query_prompt = PromptTemplate(
template=query_prompt_template,
input_variables=["user_input", "sample_output"],
)
chain = LLMChain(llm=openai_llm, prompt=query_prompt, verbose=True)
def generate_query(user_input):
result = chain.invoke({
"user_input": user_input,
"sample_output": example_string
})
clean_result = result['text'].replace("Output: ", "")
regex_pattern = r'db\.collectionName\.aggregate\(\s*\['
cleaned_output = re.sub(regex_pattern, '', clean_result)
return json.loads(cleaned_output)
mongo_client = pymongo.MongoClient(AppConfig.CONNECTION_STRING)
database = mongo_client[AppConfig.DATABASE_NAME]
collection = database[AppConfig.COLLECTION_NAME]
test_query = generate_query(user_input="what year was TechCorp established?")
The error I get is:
ValueError: Missing some input keys: {'\n "_id"'}
I’m trying to build a system that creates MongoDB aggregation pipelines automatically based on natural language questions. The PromptTemplate should work with the input variables I’m providing, but something seems wrong with the key matching. Any ideas what could be causing this mismatch?
Hit this exact ValueError last month building a MongoDB query system. It’s definitely the template variable parsing, but there’s another angle.
Your PromptTemplate declares input_variables=["user_input", "sample_output"] but your template has hardcoded schema info with JSON objects full of curly braces. LangChain scans the whole template string for brace patterns no matter where they are.
Cleanest fix I found was partial_variables for static content:
Just reference {static_schema} and {static_info} in your template instead of concatenating. This keeps the problematic JSON away from the template parser while your invoke calls stay simple. Way better than manually escaping everything, especially when your schema changes a lot.
This happens because LangChain thinks it found an input variable when it sees the "_id" field in your database schema or example data.
LangChain uses curly braces {} to spot template variables. Somewhere in your database_schema, schema_info, or example_json, you’ve got MongoDB field names with underscores wrapped in quotes.
I hit this exact issue last year building a similar query generator. You need to either escape the curly braces in your static content or change how you pass the data.
Here’s what worked for me:
query_prompt_template = """
You are a MongoDB expert with extensive experience in NoSQL database queries.
I'll give you a database_schema and schema_info in a specific format.
Your job is to read the user_input and generate a MongoDB aggregation pipeline.
Database Schema: {database_schema}
Schema Info: {schema_info}
Example:
Input: Get all films by Christopher Nolan with actor count per film
Output: {sample_output}
Return only the query without extra text.
Input: {user_input}
"""
query_prompt = PromptTemplate(
template=query_prompt_template,
input_variables=["user_input", "sample_output", "database_schema", "schema_info"],
)
LangChain’s template parser sees curly braces in your concatenated strings and thinks they’re input variables. When you concat database_schema, schema_info, and other content directly into the template, any JSON objects or MongoDB field references with braces get parsed as template variables. I hit the same thing working with database schemas that had JSON examples. The parser found patterns like {"_id": ObjectId(...)} in my schema docs and treated them as missing variables. Escape the problem braces in your static content by doubling them up. If your schema has {"_id"}, change it to {{"_id"}}. This tells LangChain’s parser they’re literal characters, not variable placeholders. You could also ditch the direct string concatenation for proper template variables like mentioned above, but if you want a quick fix without restructuring, escaping should solve it.