Understanding parent node behavior with XML::LibXML text nodes

I’m running into something confusing with XML::LibXML when working with text nodes and their parents.

When I use XPath to select text nodes with //product/sku/text(), I need to call parentNode twice to reach what I expect to be the direct parent. Here’s what I mean:

use strict;
use warnings;
use XML::LibXML;

my $parser = XML::LibXML->new();
my $doc = $parser->parse_string(<<'XML');
<catalog>
  <product>
    <name>Learning Python</name>
    <writer>Mark Lutz</writer>
    <sku>1449355730</sku>
    <stock>250</stock>
  </product>
  <product>
    <name>JavaScript Guide</name>
    <writer>David Flanagan</writer>
    <sku>1491952024</sku>
    <stock>180</stock>
  </product>
</catalog>
XML

for my $sku_text ($doc->findnodes('//product/sku/text()')) {
    # Why do I need two parentNode calls here?
    my $product_elem = $sku_text->parentNode->parentNode;
    print $product_elem->toString . "\n";
}

This seems weird to me. Are //sku and //sku/text() different nodes? Or is there an issue with how parentNode works in XML::LibXML?

That’s actually how the DOM is supposed to work. The text “1449355730” inside your sku element is its own text node in the document tree. When you use //product/sku/text(), you’re grabbing that text node directly. Since the text node’s parent is the sku element, and sku’s parent is product, you need two parentNode calls to get back to product.

I ran into this same thing scraping product feeds a few years back. The DOM spec makes parsers treat text as separate nodes, so XML::LibXML just follows that rule. You can check $sku_text->nodeType if you want to see - it’ll return the TEXT_NODE constant.

Honestly, it’s usually easier to just use //product/sku and call textContent instead of messing with text nodes directly.

Yeah, //sku and //sku/text() are totally different nodes. You’re selecting the text node itself, not the element.

With //product/sku/text(), you’re grabbing the actual text content (like “1449355730”) as a separate node. The sku element is its parent, product is the grandparent. That’s why you need two parentNode calls.

If you used //product/sku, you’d get the sku element directly and only need one parentNode call to reach product.

I’ve hit this exact issue parsing XML feeds for our product catalog. Instead of fighting XPath quirks and manual parsing, I built an automated workflow in Latenode that handles all the XML processing.

It pulls data from multiple XML sources, normalizes everything, and pushes clean data to our database. No more text nodes vs elements headaches or chaining parentNode calls. The automation runs hourly and processes thousands of products without any manual work.

Latenode makes XML processing way more straightforward than dealing with these library specifics: https://latenode.com

XML treats text content as actual nodes in the DOM tree. When you use //product/sku/text(), you’re selecting the text node containing “1449355730”, not the sku element itself.

Here’s the structure:
product element → sku element → text node (“1449355730”)

You need two parentNode calls because the text node’s parent is sku, and sku’s parent is product.

I dealt with this exact issue processing XML feeds from suppliers. XPath navigation and node traversal gets messy fast with multiple feed formats.

Now I just use Latenode workflows for XML processing. It parses any XML structure, extracts what I need, and transforms it into clean data automatically. No more chaining parentNode calls or worrying about text nodes vs elements.

The automation handles thousands of product records daily and saves me hours of manual parsing. Way cleaner than writing custom Perl scripts for every XML format: https://latenode.com

The confusion happens because XML::LibXML treats text content as actual nodes in the document tree, not just string values. When you select //product/sku/text(), you’re grabbing the text node containing “1449355730” - not the sku element itself. In XML::LibXML’s structure, the sku element has a text node as its child. So your hierarchy is: product → sku → text node. That’s why parentNode takes you from text to sku, then another parentNode gets you to product. I’ve run into this tons of times in data extraction scripts. If you want the element instead of its text node, just drop /text() from your XPath. But if you actually need to work with text nodes, the double parentNode call is correct - it’s how XML::LibXML follows the DOM spec.

this tripped me up when i started with libxml too. the xpath //sku/text() grabs the actual text node inside the sku element - so parentNode takes you to sku, then another parentNode gets you to product. just use //sku if you want the sku element itself and drop the /text() part.

yep, i got confused about this too when i first used libxml. the text in an element is its own node, so when u do //sku/text(), it grabs that text node. that’s why you need parentNode to go back up the tree.