Is it possible to extract JavaScript variables from HTML using DOMXPath?

Hey everyone, I’m trying to figure out if I can use DOMXPath to grab JavaScript variables that are embedded in an HTML document. Here’s what I’m dealing with:

<body>
  <div></div>
  <script>
    var myInfo = {
      "users": [
        {"name": "Alice", "role": "Developer"},
        {"name": "Bob", "role": "Designer"},
        {"name": "Charlie", "role": "Manager"}
      ]
    };
  </script>
</body>

I know DOMXPath can access elements inside the body tag, but I’m not sure how to handle JavaScript variables. They don’t seem to be nodes. Can DOMXPath handle this, or do I need to try something else like regex?

I’m scraping this HTML from another website, so I need to grab the data and save it locally. Any ideas on how to tackle this? Thanks!

yo, domxpath ain’t gonna cut it for js variables. what i’d do is grab the script content with document.querySelector(‘script’).textContent, then use a regex like /var\s+myInfo\s*=\s*({.*?});/s to snag that json. parse it n ur good 2 go. just watch out for security stuff n make sure the site’s cool w/ u scrapin their data

From my experience, DOMXPath isn’t designed to extract JavaScript variables directly; it is made for navigating through HTML and XML document structures. When I dealt with a similar challenge, I resorted to a combination of regular expressions and string manipulation instead. I first used regex to locate the JavaScript object, then removed the variable assignment, and finally parsed the cleaned string as JSON. This approach proved to be more robust than attempting to execute the script, especially since JavaScript variables aren’t represented as DOM nodes. It’s important to include proper error handling and be cautious of any changes in the website’s markup or terms of service.

While DOMXPath is excellent for XML/HTML parsing, it’s not suited for extracting JavaScript variables. In my professional experience, a more effective approach is to use a combination of DOM methods and regular expressions. First, I’d suggest using document.getElementsByTagName('script') to isolate all script tags. Then, employ a regex pattern like /var\s+myInfo\s*=\s*({.*?});/s to extract the variable content. Once you have the raw string, you can parse it as JSON using JSON.parse(). This method has proven reliable across various projects, though it’s crucial to implement robust error handling and stay mindful of potential changes in the source website’s structure.