Returning image data from tools in langchain/langgraph workflow

I’m working with a language model that can handle image inputs. I need to create a tool that captures a webpage screenshot and returns it to the model for processing.

Here’s my current setup:

from langchain_core.tools import tool

def capture_webpage_screenshot():
    # This function captures a screenshot and saves it locally
    return "/home/user/screenshots/webpage.png"

@tool
def webpage_screenshot_tool():
    image_file_path = capture_webpage_screenshot()
    return ???  # Not sure what to return here

I’m confused about what the tool should return. Just returning the file path as a string doesn’t seem helpful for the model. What’s the proper way to return image data from a langchain tool? Are there specific data types that work better than others when the tool decorator is used?

Had the same problem building a web automation tool last month. Base64 works but gets messy with bigger images. I’ve had good luck using langchain’s HumanMessage with image content directly. Don’t return from the tool - pass the image path to create the right message format:

from langchain_core.messages import HumanMessage
import base64

@tool
def webpage_screenshot_tool():
    image_path = capture_webpage_screenshot()
    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode()
    
    # Return structured content for the workflow
    return {
        "image_b64": image_data,
        "mime_type": "image/png",
        "file_path": image_path
    }

Then build your HumanMessage in the workflow node with the image content. You keep tool execution separate from message formatting. The workflow converts tool output into the format vision models want. Way cleaner than stuffing everything into the tool return.

You can’t just return the file path - the model needs the actual image data. I’ve hit this same problem in multiple projects.

The fix is encoding the image as base64 and returning it properly formatted:

import base64
from langchain_core.tools import tool

def capture_webpage_screenshot():
    # Your screenshot logic here
    return "/home/user/screenshots/webpage.png"

@tool
def webpage_screenshot_tool():
    image_file_path = capture_webpage_screenshot()
    
    with open(image_file_path, "rb") as image_file:
        encoded_image = base64.b64encode(image_file.read()).decode('utf-8')
    
    return f"data:image/png;base64,{encoded_image}"

This gives you a data URI that vision models can actually use. Works great with GPT-4V and Claude.

If you’re using LangGraph, try returning a dictionary instead:

return {
    "type": "image",
    "data": encoded_image,
    "format": "png",
    "description": "Screenshot of webpage"
}

This gives you better control over image processing in your workflow.