How to Parse XML in Python

TL;DR

1from xml.dom import minidom 2 3parsedXML = minidom.parseString("[your xml here]") 4 5elements_by_tag = parsedXML.getElementsByTagName('[tagname here]') # get list of elms with tagname 6print(elements_by_tag[0].firstChild.nodeValue) # print inner text value of an element 7

Introduction

I recently spent a few hours refactoring some of the backend code on this site at xtrp.io. In changing the backend, I wanted to make sure the refactored code worked the same way as my old code.

To do this, I wrote a unit test in Python that sent a request to every URL on my site running the old backend code and the corresponding URL with my local server running the new code, to make sure they worked exactly the same.

As with many other sites, xtrp.io has a sitemap at xtrp.io/sitemap/ which lists all of the URLs on the site with their titles and other information for search engines and other bots. Since this sitemap is programmed to output all of the URLs on my site (excluding error pages like the 404 page), this seemed like a perfect way to get a list of URLs on my site to test the new backend code.

Sitemaps are written in XML, a language very similar to HTML, with the same tag and element based syntax.

So, to go about getting all the URLs from my sitemap, I had to get the raw XML from xtrp.io/sitemap/, and then parse that XML to get items of a particular tag name (loc in this case).

1. Parse XML with XML.DOM Minidom

To parse the XML, we'll use Python's default xml.dom.minidom module.

The minidom module has a handy .parseString method to parse XML from a Python string. The method returns an object with other methods for getting information from the parsed XML, and selecting various elements.

1from xml.dom import minidom 2 3parsedXML = minidom.parseString(""" 4<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> 5 <url> 6 <loc>https://xtrp.io/</loc> 7 <changefreq>weekly</changefreq> 8 <priority>0.9</priority> 9 </url> 10 <url> 11 <loc>https://xtrp.io/about/</loc> 12 <changefreq>monthly</changefreq> 13 <priority>0.6</priority> 14 </url> 15</urlset> 16""") 17

Now we can call various functions on the parsedXML variable to get information from elements in the XML.

2. Get Elements by Tag Name

Getting elements by tag name from parsed XML in xml.dom.minidom is pretty simple: use the getElementsByTagName function.

For example, in my case I want a list of elements with the tag name of loc:

1elements = parsedXML.getElementsByTagName('loc') # any tag name in place of 'loc' is valid 2

3. Get the Inner Text of a Particular Element

Now that I had a list of loc elements, I needed to get text of each of those elements and add those to their own list.

To get text from an element in xml.dom.minidom, use the firstChild.nodeValue property of the element, like this: element.firstChild.nodeValue.

In my case, here's how I looped through each element and added the text content to a variable:

1elements = parsedXML.getElementsByTagName('loc') 2 3locations = [] 4 5for element in elements: # loop through 'loc' elements 6 locations.append(element.firstChild.nodeValue); # get innertext of each element and add to list 7 8print(locations) # https://xtrp.io/ https://xtrp.io/about/ 9

In getting the text of an element, the firstChild property gets the text node from the element, and the nodeValue property gets the value of that particular text node.

Conclusion

You can see the final source code of this article here at my tutorials repo.

I hope you enjoyed this post and found it useful in parsing XML from a URL in Python. I spent some time reading documentation myself to get this working myself, so I thought I'd make this post to help anyone out.

Thanks for scrolling.

— Gabriel Romualdo, August 8, 2020