Parsing XML and HTML using Perl and LibXML
I used to parse HTML data using regular expressions and XML documents using xml parsers which normally parsed documents into arrays and hashes (key-value pairs).
But this time, I needed to retrieve only specific nodes of an XML document with some specific attribute. To do this, I could retrieve all nodes and then go through all records and use some conditions to get only those I am interested in or use some clever, modern solution. And this is where libXML stepped in.
Perl XML::LibXML is probably the most used xml parsing library for Perl and is based on libxml2. What is most interesting in this library, apart from its fast parsing, is the possibility to use XPath syntax. I have already seen this term but never actually used it in practice. It allows you to specify which elements you want to retrieve very easily.
Look at this example:
#!/usr/bin/env perl use XML::LibXML; my $parser = XML::LibXML->new(); my $doc = $parser->load_html(location => "http://seznam.cz", recover => 2); print $doc->findnodes('//div[@class="text-box"]');
This code loads HTML code from seznam.cz and retrieves all DIV elements with the “text-box” class. You can later go through all returned nodes and get some parts of their bodies. It’s magical.
If you plan to process some xml file with a namespace, make sure to register the namespace and use the registered shortcut in all element names like this:
my $parser = XML::LibXML->new(); my $doc = $parser->parse_file("metadata.xml"); my $xpc = XML::LibXML::XPathContext->new($doc); $xpc->registerNs("i", "http://apple.com/itunes/importer"); foreach my $iap ($xpc->findnodes('/i:package/i:software/i:software_metadata/i:in_app_purchases/i:in_app_purchase')) { my($node) = $xpc->findnodes('./i:locales/i:locale[@name="en-US"]/i:title', $iap); ...
Here is another example of parsing other div nodes and their subnodes:
my $doc = $parser->parse_html_string($content); my $xpc = XML::LibXML::XPathContext->new($doc); foreach my $cont ($xpc->findnodes('//div[contains(@class,"story ")]')) { my $msg = $xpc->findnodes('./div[contains(@class,"msg")]', $cont)->get_node(0); my $user = $xpc->findnodes('./a/strong', $msg)->get_node(0); my $text = $xpc->findnodes('./span', $msg)->get_node(0); # ... print variables #... }
As you can see here, in xpath query you can even use some builtin functions like contains(where, what). The list of builtin functions is here.
It’s always time-consuming and for some people unpleasant to learn new thing or new ways to do something, but I find it interesting. In the end, it may help you to do difficult tasks quicker and easier way. I hope that this article has shown you some new inspiration.
You can use libXML and Xpath in many other scripts and it is more or less the same. But after a few months of working with Perl and Ruby I can highly recommend Ruby from these two. Perl is faster – yes, but Ruby is simply more comfortable and less error-prone. Using Ruby from the beginning you can save a lot of heartaches. Here is LibXML for Ruby. If you are a complete beginner, look here.