How to Save a Full Web Page Using a Headless Browser with Selenium and Python

I’m trying to find a solution for saving an entire webpage using Selenium with Python while employing a headless browser. My goal is to ensure that the saved webpage appears exactly the same as it does when accessed in a regular browser, akin to the ‘Save as…’ functionality.

I previously used a code example from Andersson that worked well, but I need to adapt it for a headless setup. Is there a way to achieve this? Here’s an example of code I used:

from selenium.webdriver.chrome.service import Service
from selenium import webdriver

service = Service('path/to/chromedriver')
options = webdriver.ChromeOptions()
options.add_argument('--headless')

browser = webdriver.Chrome(service=service, options=options)
browser.get('http://www.example.com')
# Add code here to save the webpage

Any suggestions would be greatly appreciated!

To save a full webpage using Selenium with Python in a headless browser setup, you can leverage the execute_cdp_cmd function to access Chrome DevTools Protocol (CDP). CDP provides an extensive collection of browser functionalities, including saving web pages as MHTML files, which preserve the formatting of the webpage.

Here’s how you can adjust your code using this approach:

from selenium.webdriver.chrome.service import Service
from selenium import webdriver

service = Service('path/to/chromedriver')
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--disable-gpu')  # For Windows
options.add_argument('--no-sandbox')  # For Linux

browser = webdriver.Chrome(service=service, options=options)
browser.get('http://www.example.com')

# Use CDP command to save the webpage as MHTML
mhtml = browser.execute_cdp_cmd('Page.captureSnapshot', {})

with open('webpage.mhtml', 'w', encoding='utf-8') as f:
    f.write(mhtml['data'])

browser.quit()

In this example:

  • The Page.captureSnapshot command from Chrome's DevTools Protocol is used to capture the full page's content in MHTML format.
  • The execute_cdp_cmd function enables you to interact directly with DevTools commands within Selenium.
  • This MHTML file can be opened in any web browser, preserving the text and structure of the original page, similar to the 'Save as...' feature in browsers.