How can I prepend the absolute URL to all relative links in Puppeteer when generating a PDF?

FlyingEagle · January 5, 2025, 1:39am

I have a local HTML file of a single webpage saved on my computer without any alterations. When I generate a PDF with Puppeteer, it fails to include certain images whose paths are relative. Additionally, the relative href attributes in the produced PDF point to a non-existent local address instead of the correct URL from the original webpage, which should be http://www.example.com/ plus the respective relative URL. Is there a method in Puppeteer that allows me to set a base URL so that it automatically appends http://www.example.com/ to all relative paths that begin with / in my HTML document, including for images, stylesheets, and scripts?

Finn_Mystery · January 12, 2025, 11:06pm

Hi FlyingEagle,

To prepend an absolute URL to all relative links in Puppeteer when generating a PDF, you can manipulate the HTML content before converting it into a PDF. Here's a practical way to do it using Puppeteer:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  // Load your local HTML file
  await page.goto('file:///absolute/path/to/your/localfile.html');

  // Evaluate and modify relative links
  await page.evaluate(() => {
    const base = 'http://www.example.com';
    const elements = document.querySelectorAll('[href], [src]');
    elements.forEach(el => {
      const attr = el.hasAttribute('href') ? 'href' : 'src';
      if (el.getAttribute(attr).startsWith('/')) {
        el.setAttribute(attr, base + el.getAttribute(attr));
      }
    });
  });

  // Generate PDF
  await page.pdf({ path: 'output.pdf', format: 'A4' });

  await browser.close();
})();

This snippet modifies the href and src attributes of all elements in your HTML that start with '/' by prepending http://www.example.com. This should ensure that all resources are properly linked in the resulting PDF.

Hope this solution helps you efficiently resolve the issue!

AdventurousHiker17 · January 11, 2025, 10:30pm

You can handle relative URLs in Puppeteer by manipulating HTML content before PDF generation. Here's a quick approach to prepend the base URL to all relative links:

const puppeteer = require('puppeteer');
const fs = require('fs');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  // Load and modify HTML file
  let html = fs.readFileSync('path/to/your/localfile.html', 'utf8');
  const baseUrl = 'http://www.example.com';
  html = html.replace(/(href|src)="\//g, `$1="${baseUrl}/`);

  // Set modified content
  await page.setContent(html, { waitUntil: 'networkidle0' });

  // Generate PDF
  await page.pdf({ path: 'output.pdf', format: 'A4' });

  await browser.close();
})();

This snippet replaces all href and src attributes starting with "/" with complete URLs, ensuring resources are correctly loaded in the PDF.

DancingFox · January 13, 2025, 4:36pm

If you're looking at an alternative approach to handling relative URLs before generating a PDF with Puppeteer, you might consider using the page.setContent method with an adjusted HTML content as another way to tackle this.

Here's an example:

const puppeteer = require('puppeteer');
const fs = require('fs');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  // Read your HTML file
  let html = fs.readFileSync('path/to/your/localfile.html', 'utf8');

  // Modify relative paths directly in the HTML string
  const baseUrl = 'http://www.example.com';
  html = html.replace(/(href|src)="\//g, `$1="${baseUrl}/`);

  // Set the modified HTML content
  await page.setContent(html, { waitUntil: 'networkidle0' });

  // Generate PDF
  await page.pdf({ path: 'output.pdf', format: 'A4' });

  await browser.close();
})();

In this approach, the code reads the original HTML file and uses a regular expression to search for all href and src attributes beginning with "/", replacing them with the full path including the base URL. This way, you ensure all relative paths are correctly transformed in one go before Puppeteer processes the content.

This might be a more suitable method if you're dealing with a large document or need to ensure no elements are missed due to attribute-specific checks as seen in the previous solution.