How to integrate RAGAS metrics with Langsmith for automated RAG chatbot assessment

Hi everyone! I built a RAG chatbot using langchain v2 and chainlit, and now I want to set up proper evaluation for the LLM outputs. I’m completely new to evaluation frameworks and feeling a bit lost on where to start.

From what I’ve read, RAGAS offers various evaluation metrics while Langsmith provides nice dashboards for viewing results. My main questions are:

  1. What’s the best way to connect these two tools together?
  2. Since my bot is already live in production, how do I set up automatic evaluation without disrupting users?
  3. Has anyone here successfully implemented this kind of setup before?

I’d really appreciate any guidance, code examples, or tips from your experience. Thanks for your help!

Been running this exact setup for 8 months. Skip webhooks - they get messy when you scale.

I built a middleware layer that logs everything to Langsmith and a separate evaluation queue. The key is treating evaluation as its own pipeline, completely separate from your live chatbot.

Here’s what I do: every conversation gets tagged in Langsmith with metadata (user_id, conversation_type, etc). A cron job pulls conversations from Langsmith API every few hours and runs them through RAGAS evaluation.

Use Langsmith’s dataset feature. Create datasets from your traced conversations, then run RAGAS metrics on those datasets. Clean separation between live traffic and evaluation.

For production stability, I evaluate 5% of conversations immediately and batch the rest overnight. Start with context precision and answer correctness - they catch the most real issues.

This video shows the RAGAS evaluation setup:

One gotcha: make sure your retrieval context is properly serialized in Langsmith traces, or RAGAS will fail on half your evaluations.

the data format mismatch is the biggest headache. langsmith traces don’t play nice with ragas out of the box, so you’ll need preprocessing. i run a python script hourly that pulls traces and reformats them for ragas. start with just the faithfulness metric first.

RAGAS and Langsmith can indeed complement each other effectively when integrated properly. Rather than evaluating every interaction, consider sampling around 10-15% of conversations to gather sufficient insights while minimizing costs. Utilize Langsmith’s tracing feature to capture necessary context, and schedule batch evaluations during lower traffic times to avoid disrupting live users. Creating a bridge service that formats data from Langsmith’s API for RAGAS will streamline the process, allowing you to update your dashboard with feedback scores more efficiently. Additionally, I recommend using GPT-3.5-turbo for most of your metrics, as it balances accuracy and affordability well.

Just went through this with my production RAG system. Here’s what worked: use Langsmith’s webhooks to trigger RAGAS evaluations asynchronously. Set up a separate evaluation service that retrieves trace data from those webhooks, then processes it through RAGAS metrics in the background. This setup has zero impact on live users since all operations occur outside your main application flow. I built a simple queue using Redis to manage the evaluation jobs. The challenging aspect was mapping Langsmith’s trace format to the requirements of RAGAS, particularly concerning context retrieval and answer relevance. It’s essential to push your evaluation results back to Langsmith via their SDK to maintain everything in one dashboard. I recommend starting with faithfulness and answer relevance metrics, as they are easier to implement.