How can I extract HTML table data as JSON format during Puppeteer execution?

I’m having trouble converting HTML tables to JSON format while using Puppeteer along with tabletojson library.

My process is straightforward: I use Puppeteer to open Chrome, navigate to a page, enter search terms, and click the search button. After Puppeteer completes these actions, a table gets displayed on the page. However, tabletojson keeps returning an empty array.

The issue seems to be that tabletojson runs independently of Puppeteer rather than working together with it. When tabletojson tries to access the same URL without the previous Puppeteer interactions, there’s no table data available to read yet.

I need to know: Can I extract this table data as JSON format while Puppeteer is still running? Or is there a method to execute tabletojson asynchronously during the scraping process?

const chrome = require('puppeteer')
const tableConverter = require('tabletojson')

async function scrapeData()
{
// puppeteer setup
  const browserInstance = await chrome.launch(
        {
            headless: false,
            defaultViewport: null
        }
    )
    
  const newPage = await browserInstance.newPage()
  const targetUrl = "https://example-site.com/..."
  
 
// table conversion logic

// goal is to extract JSON from table data
  // currently returns nothing because of timing issue
  
        await tableConverter.convertUrl(
            'https://example-site.com/...',
            {stripHtmlFromCells: false, stripHtmlFromCells: true },
            function(jsonTables) {
              console.log(jsonTables);
            }
        );
        
}

scrapeData()

The HTML structure looks like this:

<form method="post" action="/SearchResults/pages/main.xhtml" enctype="application/x-www-form-urlencoded">

<!-- Input fields here -->

<button id="searchForm:mainTab:submitBtn" name="searchForm:mainTab:submitBtn" class="ui-button ui-widget ui-state-default ui-corner-all ui-button-text-only" onclick="" type="submit" role="button" aria-disabled="false">
<span class="ui-button-text ui-c">Submit Query</span>
</button>

<!-- Table appears here after clicking Submit -->

</form>

Any help would be appreciated!

You’re on the right track but mixing two different approaches. Instead of using tabletojson separately, you should extract the table data directly within your Puppeteer context using page.evaluate(). This way you can access the DOM after all your interactions are complete.

Here’s what worked for me in a similar situation:

const chrome = require('puppeteer')

async function scrapeData() {
    const browserInstance = await chrome.launch({
        headless: false,
        defaultViewport: null
    })
    
    const newPage = await browserInstance.newPage()
    await newPage.goto("https://example-site.com/...")
    
    // Your search interactions here
    await newPage.click('#searchForm\\:mainTab\\:submitBtn')
    await newPage.waitForSelector('table') // wait for table to appear
    
    // Extract table data as JSON
    const tableData = await newPage.evaluate(() => {
        const table = document.querySelector('table')
        const rows = Array.from(table.querySelectorAll('tr'))
        
        return rows.map(row => {
            const cells = Array.from(row.querySelectorAll('td, th'))
            return cells.map(cell => cell.textContent.trim())
        })
    })
    
    console.log(tableData)
    await browserInstance.close()
}

This approach keeps everything within the same browser context where your table actually exists.

The fundamental problem you’re encountering is that tabletojson operates independently and cannot access the dynamically generated content from your Puppeteer session. When you perform interactions like form submissions, the resulting table exists only in your active browser instance.

I faced a similar challenge when scraping search results that required multiple form interactions. The solution involves capturing the HTML content directly from your Puppeteer page and then processing it with tabletojson. Here’s how you can modify your approach:

const chrome = require('puppeteer')
const tableConverter = require('tabletojson')

async function scrapeData() {
    const browserInstance = await chrome.launch({
        headless: false,
        defaultViewport: null
    })
    
    const newPage = await browserInstance.newPage()
    await newPage.goto("https://example-site.com/...")
    
    // Perform your search interactions
    await newPage.click('#searchForm\\:mainTab\\:submitBtn')
    await newPage.waitForSelector('table')
    
    // Get the HTML content after interactions
    const htmlContent = await newPage.content()
    
    // Use tabletojson with the HTML string instead of URL
    const jsonTables = tableConverter.convert(htmlContent, {
        stripHtmlFromCells: false
    })
    
    console.log(jsonTables)
    await browserInstance.close()
}

This approach allows you to use tabletojson while maintaining all the dynamic content generated through your Puppeteer interactions. The key is using convert() method with HTML content rather than convertUrl().

yeah puppeteer’s evaluate method is definitley the way to go here. tabletojson wont work becuase it makes a fresh request without your search data. try using page.$$eval('table tr', rows => ...) to grab all table rows at once after your form submission completes.