I recently spent a few hours refactoring some of the backend code on this site at xtrp.io. In changing the backend, I wanted to make sure the refactored code worked the same way as my old code.
To do this, I wrote a unit test in Python that sent a request to every URL on my site running the old backend code and the corresponding URL with my local server running the new code, to make sure they worked exactly the same.
As with many other sites, xtrp.io has a sitemap at xtrp.io/sitemap/ which lists all of the URLs on the site with their titles and other information for search engines and other bots. Since this sitemap is programmed to output all of the URLs on my site (excluding error pages like the 404 page), this seemed like a perfect way to get a list of URLs on my site to test the new backend code.
Sitemaps are written in XML, a language very similar to HTML, with the same tag and element based syntax.
So, to go about getting all the URLs from my sitemap, I had to get the raw XML from xtrp.io/sitemap/, and then parse that XML to get items of a particular tag name (
loc in this case).
1. Parse XML with XML.DOM Minidom
To parse the XML, we'll use Python's default xml.dom.minidom module.
minidom module has a handy
.parseString method to parse XML from a Python string. The method returns an object with other methods for getting information from the parsed XML, and selecting various elements.
Now we can call various functions on the
parsedXML variable to get information from elements in the XML.
2. Get Elements by Tag Name
Getting elements by tag name from parsed XML in xml.dom.minidom is pretty simple: use the
For example, in my case I want a list of elements with the tag name of
3. Get the Inner Text of a Particular Element
Now that I had a list of
loc elements, I needed to get text of each of those elements and add those to their own list.
To get text from an element in xml.dom.minidom, use the
firstChild.nodeValue property of the element, like this:
In my case, here's how I looped through each element and added the text content to a variable:
In getting the text of an element, the
firstChild property gets the text node from the element, and the
nodeValue property gets the value of that particular text node.
You can see the final source code of this article here at my tutorials repo.
I hope you enjoyed this post and found it useful in parsing XML from a URL in Python. I spent some time reading documentation myself to get this working myself, so I thought I'd make this post to help anyone out.
Thanks for scrolling.
— Gabriel Romualdo, August 8, 2020