I need help with capturing an entire web page using Python and Selenium in headless mode. My goal is to download the complete webpage exactly as it displays in the browser, similar to the “Save page as” option in Chrome or Firefox.
I found some working code that uses a visible browser window with automation hotkeys, but I want to modify it for headless operation. Here’s a different approach I’m trying:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
chrome_options = Options()
chrome_options.add_argument("--headless")
browser = webdriver.Chrome(options=chrome_options)
browser.get("https://www.example.com")
time.sleep(3)
# This is where I'm stuck - how to save the complete page?
page_source = browser.page_source
with open("saved_page.html", "w") as file:
file.write(page_source)
browser.quit()
The problem is this only saves the HTML source code, not the complete page with images and styling. Is there a way to save everything in headless mode without using keyboard shortcuts?
yeah, headless chrome lacks a direct “save as” like the ui version. you might want to check out wget after loading the page, or explore chrome devtools protocol for better options. some users also turn to requests-html for tasks like this.
I’ve dealt with this exact issue before and found that combining Selenium with additional HTTP requests works well. After loading the page with headless Chrome, you can extract all resource URLs from the page source and download them separately using the requests library. Create a local directory structure that mirrors the webpage’s assets, then update the HTML references to point to your local files. It requires parsing CSS files for additional resources like fonts and background images, but gives you complete control over what gets saved. The downside is it’s more complex than a simple page save, but it’s reliable for headless automation.
Chrome DevTools Protocol offers a more direct solution for this. You can use the Page.captureSnapshot command which creates a complete MHTML archive containing all resources. Here’s how I implemented it using the pychrome library:
import pychrome
import base64
browser = pychrome.Browser(url="http://127.0.0.1:9222")
tab = browser.new_tab()
tab.start()
tab.Page.enable()
tab.Page.navigate(url="https://www.example.com")
tab.wait(5)
result = tab.Page.captureSnapshot(format="mhtml")
with open("complete_page.mhtml", "wb") as f:
f.write(base64.b64decode(result['data']))
tab.stop()
This approach captures everything including CSS, images, and JavaScript exactly like the browser’s save function. The MHTML format preserves the complete page structure and can be opened in any browser. You’ll need to launch Chrome with remote debugging enabled first, but it’s much cleaner than manually downloading individual resources.