Understanding parent node relationships with XML::LibXML text nodes

I’m running into a confusing issue with XML::LibXML when working with text nodes and their parent relationships.

When I use an XPath expression like //product/id/text() to select text nodes, I notice something strange about the parent node hierarchy. To reach what I consider the actual parent element, I need to call parentNode twice instead of once.

Here’s a code example that shows the problem:

use strict;
use warnings;
use XML::LibXML;

my $parser = XML::LibXML->new;
my $doc = $parser->parse_string(<<'XML');
<store>
  <product>
    <name>Programming Perl</name>
    <author>Larry Wall</author>
    <id>978-0596004927</id>
    <price>45.99</price>
    <cover url="http://example.com/covers/perl.jpg" width="120" height="160" />
  </product>
  <product>
    <name>Learning Python</name>
    <author>Mark Lutz</author>
    <id>978-1449355739</id>
    <price>39.99</price>
    <cover url="http://example.com/covers/python.jpg" width="120" height="160" />
  </product>
</store>
XML

foreach my $id_text ($doc->findnodes('//product/id/text()')) {
    # Need double parentNode call to get <product> element
    my $product_node = $id_text->parentNode->parentNode;
    print $product_node->toString;
}

Is this the expected behavior? Are //id and //id/text() actually different nodes in the DOM tree? I’m confused about whether this is normal or if there’s something wrong with how I’m understanding the node structure.

yeah, this confused me when i started using XML::LibXML. the xpath text() grabs the actual text node, not the string value. so you end up with product → id → text_node as separate DOM elements. that’s why you need the double parentNode - you’re climbing back up two levels in the tree.

Hit this same problem building XML processing pipelines for data ingestion. The double parentNode thing confused me until I figured out what’s happening.

When you use text() in XPath, you’re not getting a string - you’re getting a DOM node that represents the text content. Your hierarchy looks like:

<product><id> → text node containing “978-0596004927”

That text node is a real DOM object, so you need two jumps to get back to product.

After dealing with enough XML parsing headaches, I started automating our XML workflows. Instead of writing custom Perl scripts that handle DOM traversal logic, I set up automated processes that parse XML files, extract data, and transform it into whatever format we need.

You don’t worry about node types or parent relationships anymore. Just define what data you want extracted and where it goes. Way cleaner than maintaining custom parsing code.

For serious XML processing work, automation saves tons of debugging time.

Yeah, totally expected. You’re working with three different node types here.

When you do //product/id/text(), you’re grabbing the actual text content node. Here’s how the DOM looks:

  • <product> element
    • <id> element
      • text node with “978-0596004927”

So text() gets you that bottom text node. Its parent is the <id> element, then the <product> element above that. That’s why you need two parentNode calls.

If you just used //product/id, you’d get the <id> element directly and only need one parentNode call.

I hit similar XML parsing headaches in production. Instead of fighting DOM traversal logic, I automated everything with Latenode. You can build workflows that parse XML, extract what you need, and transform it however you want without writing custom traversal code.

For complex XML stuff, automation beats manual parsing every time. Way less buggy and easier to maintain.

Yes, this can be confusing at first, but it’s typical behavior across DOM parsers, not just in Perl’s XML::LibXML. In XML, text content does not reside directly within element nodes; it is represented as a separate node in the DOM structure. For instance, when you have an <id> element containing 978-0596004927, the actual text is a child node of the <id> element. Thus, when your XPath expression //product/id/text() selects the text node, it places you at that leaf node. To access the <id> element, you use the first parentNode, and to get to the <product>, you need a second parentNode call. Understanding that text nodes are separate from their parent elements is key to effectively working with XML in this way. This realization can alleviate confusion during debugging.

This is completely normal - it’s just how XML DOM structures work. The confusion happens because people think text() in XPath gives you the string value, but it actually returns a text node object.

In XML::LibXML, every piece of text content is its own node in the tree. So when you’ve got <id>978-0596004927</id>, the DOM creates an element node for <id> and a separate text node with the string content. The text node is a child of the element node.

Your XPath //product/id/text() goes straight to that text node. Since the text node’s parent is the <id> element, you need the first parentNode call to reach <id>, then a second one to reach <product>.

Want to avoid the double parentNode calls? Change your XPath to //product/id to select the <id> elements directly, then use textContent to extract the string values. That gives you the element node as your starting point instead of the text node.