I’ve built a LangChain AI agent that queries pandas DataFrames and creates plotly visualizations. The agent gets detailed instructions through pre-prompts, along with chat history and table metadata. Most of the time it works great, but I’m having issues with inconsistent query results.
The main problem happens when users search for countries using different formats. For example, when someone looks for Puerto Rico, the agent should find “PRI” in the data, but sometimes it searches for “PR” instead and returns nothing. Same thing with Mexico - users might type “Mexico”, “MEX”, or “MX”, but the agent doesn’t always handle these variations correctly.
I’ve already included column metadata for all three datasets in my pre-prompt. The agent has memory and access to chat history. I set the temperature to 0 for consistency. The capitalization issue was mostly fixed by adding instructions to the pre-prompt.
I don’t want to hardcode every possible country name variation. Is there a better way to make the agent more flexible with user input while keeping the queries accurate?
I’d build a similarity-based matching tool right into your agent’s workflow. I created a custom tool that calculates string similarity scores between user input and all country identifiers in the DataFrame. When the agent hits a country search, it calls this similarity function first to find close matches above a threshold. The key difference? This happens during the agent’s reasoning process, not as preprocessing. Your agent learns when to use the similarity tool for ambiguous country names. This keeps LangChain agents flexible while fixing the inconsistency issue. I use rapidfuzz - it’s faster than fuzzywuzzy and handles DataFrame lookups efficiently. The agent picks the best match based on context and similarity scores, making it smarter about resolving country names.
Add a validation step - if the query comes back empty, have the agent automatically try different country variations from your dataframe. So when ‘PR’ doesn’t work for Puerto Rico, it’ll just try ‘PRI’ next without you having to do anything.
I had the same issue with my DataFrame agent project. Here’s what worked for me: Create a mapping dictionary that gets referenced during searches. Don’t hardcode everything - build it dynamically by extracting unique country codes from your data and creating common aliases programmatically. Try using pycountry to generate standard mappings between country names and ISO codes. Have your agent check this mapping before running queries. I also used fuzzy string matching with fuzzywuzzy to catch close-but-not-exact matches. The trick is preprocessing user input through these mapping/matching steps before it hits your DataFrame query logic.
The problem is you’re expecting the LLM to guess country codes. I’ve hit this same wall multiple times.
Here’s what works: add a preprocessing step before the agent sees anything. Build a lookup function that scans user input and returns actual matches from your data.
First, pull every unique country identifier from your DataFrame. Then create a reverse lookup - “Puerto Rico” maps to whatever codes you actually have (“PRI”, “PR”, “Puerto Rico”, whatever).
Make it dynamic. User types “Mexico”? Your preprocessor spits back “Found matches: MEX, MX, Mexico” and lets the agent pick what exists in that specific DataFrame.
I set this up as a tool the agent can call. Instead of guessing, it asks “what country codes are available for Mexico?” and gets real answers from your data.
This saved me weeks of debugging. The agent stops making up codes and uses actual ones.
normalize inputs! like, lowercase them before you search. it’s a simple fix that could help tackle those inconsistency issues. good luck, hope it works out!