Free alternatives to langsmith's Prompt Canvas for testing prompts

Hello everyone! I’m looking for open source alternatives that work like langsmith’s Prompt Canvas feature. I’ve got several models running through ollama on my local machine and I want to evaluate the same prompt across multiple models at once. Is there any free tool that lets me run batch testing with custom datasets? I need something that can handle simultaneous prompt testing on different models so I can compare their responses side by side. Any recommendations would be really helpful!

I’ve been using PromptFoo for exactly this and it’s solid. It’s open source and handles batch evaluation across multiple models really well. You can set up custom datasets in CSV or JSON and run the same prompts against different models at once. The comparison view makes it easy to analyze responses side by side. Since you’re already running ollama locally, PromptFoo integrates nicely through API calls. Setup is straightforward - define your test cases, specify which models to test, and it runs everything in parallel. You get detailed comparisons and can add custom scoring metrics if needed. Really useful for A/B testing prompts across different model versions.

Try Open WebUI if you haven’t yet - it’s got comparison mode that works great with ollama models. Run the same prompt on multiple models at once and see results in a clean interface. Chainlit’s another solid choice for building custom evaluation dashboards. More setup involved but you get full control over how you compare outputs. For batch testing, I just write simple Python scripts using requests library to hit ollama endpoints. Works surprisingly well. Structure your test data however you want and export to whatever format works for analysis. Main advantage? You’re not stuck with any tool’s workflow and can customize evaluation criteria based on what actually matters for your case.

just use jupyter notebooks with asyncio. run concurrent requests to your ollama endpoints and dump the results into pandas dataframes - makes comparison super easy. way more flexible than any premade tool, and you can visualize with matplotlib or plotly however you want. takes maybe 20 minutes to set up a basic comparison script.

Been doing this for years with different setups. Here’s what actually works:

Build a FastAPI service between your tests and ollama. Queue up hundreds of prompt variations, run them across models, get structured results back.

I store everything in SQLite for quick queries. Way easier to spot patterns when you can filter by model, prompt type, or response scores.

Real game changer is adding evaluation metrics upfront. Response length, keyword matching, sentiment scores - saves hours of manual review.

For UI, Gradio’s perfect for quick dashboards. Upload test datasets, trigger batch runs, browse results without building a full frontend.

Takes a weekend to set up, then you can test any models without thinking about it. Way better than wrestling with tools that almost work.

The Problem: You’re looking for an open-source solution to batch test prompts across multiple LLMs (like those running on Ollama) and compare their responses side-by-side, similar to LangSmith’s Prompt Canvas. You need a tool or method to handle simultaneous prompt testing and easily analyze the results.

:thinking: Understanding the “Why” (The Root Cause): Manually testing prompts across multiple models is time-consuming and inefficient. A streamlined automated solution allows for faster iteration, better prompt engineering, and a more objective comparison of different models’ performance. Building a custom solution offers greater flexibility and control than relying on pre-built tools that might not perfectly fit your workflow or scaling needs.

:gear: Step-by-Step Guide:

  1. Automate Your Prompt Testing Pipeline: The most efficient approach involves automating the entire process. This means creating a system that:

    • Takes a list of prompt variations as input (e.g., from a CSV or JSON file).
    • Sends each prompt to each specified LLM API endpoint (your Ollama models).
    • Collects the responses from each model.
    • Organizes and compares the responses in a structured format (e.g., a table or spreadsheet).

    This can be achieved using Python with libraries such as requests for making API calls and pandas for data manipulation. Here’s a basic conceptual outline:

    import requests
    import pandas as pd
    
    # ... (Your Ollama API key and model endpoint configurations) ...
    
    prompts = ["Prompt 1", "Prompt 2", "Prompt 3"]
    models = ["model_A", "model_B", "model_C"]
    
    results = []
    for prompt in prompts:
        row = {"prompt": prompt}
        for model in models:
            response = requests.post(f"{ollama_endpoint}/{model}", json={"prompt": prompt}) # Adapt to your Ollama API
            row[model] = response.json()["response"] # Adapt to your Ollama API response structure
        results.append(row)
    
    df = pd.DataFrame(results)
    print(df)
    df.to_csv("results.csv") # Save results to CSV for further analysis
    
  2. Consider a Task Scheduler: To run tests regularly (e.g., daily or weekly), use a task scheduler like cron (Linux/macOS) or Task Scheduler (Windows) to automatically execute your Python script.

  3. Enhance with Evaluation Metrics: Add automated evaluation metrics to your script. This could include:

    • Response length.
    • Keyword matching (check for the presence of specific keywords in the response).
    • Sentiment analysis (using a library like textblob).
    • BLEU score (for comparing machine translation outputs).
  4. Notification System (Optional): Integrate a notification system (e.g., via email or Slack) to alert your team when tests are complete.

  5. Data Storage and Visualization (Optional): Store results in a database (like SQLite, as suggested in another answer) for easy querying and long-term tracking. Use visualization tools (like Matplotlib or Plotly) to create charts and graphs to easily analyze trends in model performance over time.

:mag: Common Pitfalls & What to Check Next:

  • API Rate Limits: Be mindful of Ollama’s API rate limits. If you exceed them, your requests will be throttled or rejected. Implement error handling and potentially add delays between requests to avoid this.
  • Error Handling: Implement robust error handling in your script to gracefully handle potential issues like network errors, API errors, and unexpected response formats.
  • Authentication: Double-check your Ollama API key and ensure it has the necessary permissions.
  • Data Format Consistency: Ensure your prompt variations and the expected response formats are consistent throughout your testing process.

:speech_balloon: Still running into issues? Share your (sanitized) config files, the exact command you ran, and any other relevant details. The community is here to help!

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.