I’m searching for an open source solution that can automate web browsing tasks directly within the browser itself, not through external automation tools.
What I need is a tool that can:
Work as a browser extension or embedded script
Connect to AI services like OpenAI or Claude
Perform basic web interactions such as:
navigateTo(link)
scrollPage(distance)
clickElement(cssSelector)
fillForm(selector, content)
extractData(selector)
executeScript(jsCode)
The idea is to give it instructions like “Visit this shopping site and look for wireless headphones” and let the AI analyze the page structure and decide what actions to take next.
Most solutions I’ve found work from server side using headless browser instances. But I want something that operates completely within the active browser tab so I can interrupt the process and take manual control when needed.
Selenium WebDriver has a Chrome extension mode that could work. I tested it last month - it runs in your regular browser instead of going headless. You’ll still need to handle the AI connection separately though. I used a local WebSocket server to connect the extension with Claude’s API. Works pretty well, but page refreshes sometimes kill the connection.
Puppeteer-in-browser is worth a look, but it’s still experimental. The biggest pain is security restrictions - extensions can’t just call AI services without proper CORS setup. I got around this with a lightweight proxy server between the extension and OpenAI’s API. For interruptions, I used a simple state machine where each step checks for user input before moving on. The extension watches for keyboard shortcuts or button clicks to pause everything. What really got me was how differently sites handle dynamic content. You need solid waiting mechanisms before the AI analyzes page structure, or it’ll miss elements that load after the initial render. DOM observer pattern works great here. Performance tanks when running complex AI analysis on heavy pages, so throttle your requests or cache repeated actions locally.
I built something like this about six months ago. You’re right - most automation frameworks hate interactive browsing sessions since they’re built for headless environments. I solved it with a browser extension using WebExtensions API plus content scripts. The extension connects to AI services through background scripts while content scripts handle DOM manipulation in each tab. You can pause automation anytime and take manual control without breaking anything. The tricky part is managing communication between the AI service, extension background, and active tab content script. You need solid state management so the AI keeps context when users navigate pages or interrupt the flow. For AI integration, I broke page analysis into steps instead of cramming everything into one API call. Extract page structure first, let AI plan actions, then execute step-by-step with feedback loops. Haven’t found any good open source projects that nail this use case. They’re either too basic or need heavy server infrastructure.