Web Scraping, Part 2 – Parsing

In the last post, we saw that many times the page you request via the web is not the page you end up receiving; in a similar manner, often the page you’ve downloaded looks nothing like the page you end up viewing. When you download a web page, your browser executes scripts and styling rules that can transform the raw content into something entirely different. Consequently, the task of parsing web content from a received HTTP response is often not as straightforward as splitting strings.

A quick demo. Open up a new tab and head to Google. Click in your browser’s URL box, enter the following line, and press Enter.  (Note that some browsers might not let you copy/paste this as-is, so double-check that your pasted version actually includes the “javascript:” part.)
javascript: (function() {document.getElementById('hplogo').srcset = 'https://upload.wikimedia.org/wikipedia/commons/2/24/Yahoo%21_logo.svg';})();

Hopefully this isn’t too surprising, but let’s recap: we downloaded a page from Google, we ran a script, and now we have Google search with a Yahoo logo.  The point is that your browser can and will run all sorts of scripts that change what you see, making the simple task of parsing HTML into one that can be much more complex.

As the browser reads web pages and evaluates styles and scripts, it builds up a hierarchical representation of all the elements on the page known as the Document Object Model (the DOM).  The DOM is dynamic, and holds a lot more detail than what’s actually visible in your browser.  This is important to understand, because when you are scraping a web page, sometimes you need to parse the raw HTML from your request, aka the source; but sometimes, you need to parse the DOM.

By far, the best tools for exploring the source and DOM are your browser’s own developer diagnostics.  Each major browser comes packaged with or offers a developer panel, a way for folks who are building websites to debug them.  Chrome has Developer Tools; Firefox has Toolbox; IE/Edge has F12 Tools; Safari has an Inspector.  For this post (and all future posts on the subject), I’ll be referring to Chrome’s toolbox.

In Chrome, open up a new tab, and head again to Google.  Now, right-click on the page (anywhere except the Google logo image) and select View Page Source (also available from the View -> Developer -> View Source menu).  What you’re now looking at is the raw HTML your browser ultimately retrieved when accessing http:\\www.google.com – there’s a bit of HTML, and a lot of JavaScript, all of which gets executed by your browser.

Search the page for hplogo – you should see one occurrence, as (at the time of this writing) this is the id for the main logo’s image element:

We can see that Google is using an image srcset to determine which image to show, picking, at least in my case, https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png.

 Now, head back to the logo image, right-click on it, and select Inspect, which will bring up Chrome’s view of the DOM, highlighting the logo image we just found in the source page.  No surprises here – the detail matches exactly what we saw in the source.

So far, so good.  Now, go ahead and re-run the bit of code from earlier in the post, converting the Google logo into a Yahoo one.  The DOM should now show your new image src – the DOM always reflects the current state of the page.  If, however, you re-view the page source, that will remain unchanged, as that simply shows the initial content your browser downloaded.

I’m belaboring the point because it’s important: the Python libraries we used in the prior post to retrieve the web page only give us access to the source; they do not construct a document object model, they do not retrieve linked styles or scripts, and they do not execute any JavaScript.  In many cases, the data you want is in the DOM, and in those cases you need to do more work to get at it.

We’ll be covering parsing the DOM in future posts; for the remainder of this one, let’s dig a bit into parsing the source, the raw HTML.

An HTML document is really just some text that usually conforms to the HTML standard.  Many web pages loosely conform to the HTML standard, and many others don’t conform at all.  Most browsers are smart enough to deal with poorly written HTML.  Given HTML is just structured text, we could use regular expressions in many cases to extract the content of interest.  For example, even on the shittiest of web pages, links are still contained in anchor elements.

Let’s again use the exercise of figuring out the titles of the posts on my blog’s landing page.  Python makes regex very easy – the hard part is constructing the right expression.

import urllib2
import re

html = urllib2.urlopen('http://www.craigperler.com').read()

re.findall(r'<a.*www.craigperler.com/blog/2.*>([A-Za-z]+[A-Za-z1-9 ]+)</a>', html)
'''['On Web Scraping',
'Tracking Personal Finances',
'Better Babies',
'I Type Faster Than You',
'Side Projects',
'Plumbing and Coding',
'On Web Scraping',
'Tracking Personal Finances',
'Better Babies',
'I Type Faster Than You']'''

Deconstructing that regex, it says to:

  1. Find any text that starts with “<a”
  2. That is followed by any characters which must be followed by the text “www.craigperler.com/blog/2”
  3. Which is then followed by any characters which must then be followed by “>”
  4. Thereafter, we capture a group, where the captured text must start with a letter [A-Za-z], but can be followed by a letter, number or space [A-Za-z1-9 ]
  5. And the group must be followed by “</a>”

Regex does the job… in this very simple case. What if we wanted to grab any blog post title on the landing page that’s tagged with “python” or “development”? Figuring out the regex for this would be a tremendous task. Of course, there are better ways for parsing data from HTML or HTML-like content.

There are two main libraries when it comes to HTML parsing: lxml and BeautifulSoup.  There’s a lot of overlap across the libraries, and in general, each provides an API that wraps the others.  There are differences in performance and probably in some edge case behaviors, but in general the choice in package is a matter of preference (IMO). With all of these libraries, there are a couple ways of accomplishing the same goal. A few are highlighted below.

One approach with LXML is using XPath to traverse the document. XPath is a language for querying an XML document, and as well-structured HTML follows an XML-defined schema, we can often use XPath as a way of parsing HTML. With XPath, we use the HTML tags and attributes to move throughout the document and pull out the fields of interest. In the following example, we use the XPath string, “//a[@itemprop=’mainEntityOfPage’]/text()“.  This means, starting from the root of the document, find all anchor tags that have an itemprop attribute that contains only the value “mainEntityOfPage”, and from each of those tags, pull out the contained text.  This XPath relies on the fact that each of the landing page blog post links have that itemprop attribute, something we can only know from looking at the page source or inspecting the blog.

import urllib2
response = urllib2.urlopen('http://www.craigperler.com')

# Build a document tree representation of the HTML content:
from lxml import etree
tree = etree.parse(response, etree.HTMLParser())

# Use xpath to find the "nodes" of interest:
tree.xpath("//a[@itemprop='mainEntityOfPage']/text()")
'''['Web Scraping, Part 1',
 'On Web Scraping',
 'Tracking Personal Finances',
 'ProjectSHERPA: a startup retrospective',
 'Better Babies',
 'Can Yelp Reviews Predict Real Estate Prices?',
 'I Type Faster Than You',
 u'Don\u2019t Look at My Baby',
 'Side Projects',
 'Plumbing and Coding']'''

One thing to note here is that we get more posts back than the regex method.  This is because our regex pattern was not accounting for unicode characters, so it didn’t find the “Don’t Look at My Baby” link, which has the unicode apostrophe (\u2019).  The XPath doesn’t care what characters are in the text – it’s just more forgiving than our very specific regex pattern.

Another option with LXML is to use a CSS selector.  CSS defines how to apply styling to HTML content; CSS selectors are a way of identifying the elements to which a given styling rule should apply.  So, just like XPath can find the part of the document that has the content we want, CSS selectors can do the same.  We can use the same itemprop with CSS selectors by querying for “a[itemprop=’mainEntityOfPage’]“.  This means the same thing as it did earlier, but uses a different syntax.  The code looks like:

import urllib2
response = urllib2.urlopen('http://www.craigperler.com')

# Build a document tree representation of the HTML content:
from lxml import etree
tree = etree.parse(response, etree.HTMLParser()) 

# Use CSS to select the tags of interest:
from lxml.cssselect import CSSSelector
selector = CSSSelector('a[itemprop="mainEntityOfPage"]')

# Print out the text attribute of each selected HTML tag:
[element.text for element in selector(tree)]

BeautifulSoup doesn’t offer an XPath option, but does support a subset of CSS selection in addition to its own API.

import urllib2
response = urllib2.urlopen('http://www.craigperler.com').read()

# Make the soup (wrap the HTML response in a BeautifulSoup instance):
from bs4 import BeautifulSoup
soup = BeautifulSoup(response)

# Pull out the links using CSS selection:
links = soup.select('a[itemprop="mainEntityOfPage"]')
[link.text for link in links]

BeautifulSoup also offers its own API that may or may not be more intuitive depending on your tastes:

import urllib2
response = urllib2.urlopen('http://www.craigperler.com').read()

# Make the soup (wrap the HTML response in a BeautifulSoup instance):
from bs4 import BeautifulSoup
soup = BeautifulSoup(response)

# Use the soup's findAll call to retrieve the links:
links = soup.findAll('a', itemprop='mainEntityOfPage')
[link.text for link in links]

Using XPath, CSS, and regex, it’s possible to identify and extract most elements in a well-structured web page. There are of course cases, especially when authors are specifically attempting to block scraping, where the pages are obfuscated, and more involved approaches are required.

That said, simply knowing how to retrieve and parse web pages, you can get stuff done with web scraping. We can build a very rudimentary web crawler that will analyze a page for links, and then traverse those related pages for more links, until an entire site has been exhaustively searched. This is a recursive or iterative process – as each page is scraped, we need to re-run the scraping process on all of the related links of interest.

The following bit of code will crawl through my entire blog site, starting with the landing page. With each scraped page, it will identify all other links that point to places on my blog, and will build a dictionary that holds link relationships (which page points to which other pages).

import urllib2
from lxml import etree

# The starting URL for the crawler:
start_url = 'http://craigperler.com'

# Given a URL, return a list of all the links that point to other pages on my blog.
# Uses XPath to filter only links on craigperler.com/blog.
def get_links(url):
    try:
        response = urllib2.urlopen(url)
        tree = etree.parse(response, etree.HTMLParser())
        return tree.xpath("//a[contains(@href, 'craigperler.com/blog')]/@href")
    except Exception as e:
        # Some linked pages on my site such as MIDI files aren't retrievable or parse-able in this
        # way.  Rather than account for those, this simply ignores attempts to pull them down.
        return []

# Run the crawler, starting with the start_url, and using the link_retriever function
# for extracting links from the scraped pages.
def crawl(start_url, link_retriever=get_links):
    links = {} # Holds the dictionary of links to return at the end.
    stack = [] # A (fake) stack structure to track what pages to crawl next.

    # Start the crawler by retrieving the start_url:
    stack.append(start_url)

    # While there are pages to crawl:
    while len(stack) > 0:
        # Get the next URL to crawl, "popping" it from the stack:
        next_url = stack[0]
        del stack[0]

        # Add the URL about-to-be scraped to our list of scraped URLs:
        links[next_url] = []

        # Get the links on the URL:
        for link in link_retriever(next_url):
            links[next_url].append(link)

            # Clean each retrieved link - this is just to speed up the process:
            if '#' in link:
                link = link.split('#')[0]
            if '?' in link:
                link = link.split('?')[0]
            if not link.endswith('/'):
                link += '/'

            # Add each link not yet seen to the stack, so that it will in turn be crawled:
            if link not in links:
                stack.append(link)
                links[link] = []

    return links

Leave a Reply

Your email address will not be published. Required fields are marked *