I’m working with a language model that can handle image inputs. I need to create a tool that captures a webpage screenshot and returns it to the model for processing.
Here’s my current setup:
from langchain_core.tools import tool
def capture_webpage_screenshot():
# This function captures a screenshot and saves it locally
return "/home/user/screenshots/webpage.png"
@tool
def webpage_screenshot_tool():
image_file_path = capture_webpage_screenshot()
return ??? # Not sure what to return here
I’m confused about what the tool should return. Just returning the file path as a string doesn’t seem helpful for the model. What’s the proper way to return image data from a langchain tool? Are there specific data types that work better than others when the tool decorator is used?
Had the same problem building a web automation tool last month. Base64 works but gets messy with bigger images. I’ve had good luck using langchain’s HumanMessage with image content directly. Don’t return from the tool - pass the image path to create the right message format:
from langchain_core.messages import HumanMessage
import base64
@tool
def webpage_screenshot_tool():
image_path = capture_webpage_screenshot()
with open(image_path, "rb") as f:
image_data = base64.b64encode(f.read()).decode()
# Return structured content for the workflow
return {
"image_b64": image_data,
"mime_type": "image/png",
"file_path": image_path
}
Then build your HumanMessage in the workflow node with the image content. You keep tool execution separate from message formatting. The workflow converts tool output into the format vision models want. Way cleaner than stuffing everything into the tool return.