{"id":1187,"date":"2016-10-30T18:22:06","date_gmt":"2016-10-30T22:22:06","guid":{"rendered":"http:\/\/www.craigperler.com\/blog\/?p=1187"},"modified":"2024-06-06T23:25:25","modified_gmt":"2024-06-07T03:25:25","slug":"web-scraping-part-2-parsing","status":"publish","type":"post","link":"https:\/\/www.craigperler.com\/blog\/2016\/10\/30\/web-scraping-part-2-parsing\/","title":{"rendered":"Web Scraping, Part 2: Parsing"},"content":{"rendered":"<div class=\"wp-block-image\">\n<figure class=\"alignleft\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"890\" height=\"593\" src=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816705ca5bf8.png?resize=890%2C593&#038;ssl=1\" alt=\"\" class=\"wp-image-1215\" srcset=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816705ca5bf8.png?w=890&amp;ssl=1 890w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816705ca5bf8.png?resize=300%2C200&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816705ca5bf8.png?resize=768%2C512&amp;ssl=1 768w\" sizes=\"auto, (max-width: 890px) 100vw, 890px\" \/><\/figure>\n<\/div>\n\n\n<p id=\"gWDoUSD\"><\/p>\n\n\n\n<p>In the <a href=\"https:\/\/www.craigperler.com\/blog\/2016\/10\/27\/web-scraping-part-1\/\">last post<\/a>, we saw that many times the web page you request is not the page you receive.  Similarly, often the page you&#8217;ve downloaded looks nothing like the page you end up viewing. When you download a web page, your browser executes scripts that transform the raw content into something entirely different. Consequently, the task of parsing web content from a received HTTP response is often not as straightforward as splitting strings.<\/p>\n\n\n\n<h2 id=\"a-quick-demo\" class=\"wp-block-heading\">A Quick Demo<\/h2>\n\n\n\n<p>Open up a new tab and head to <a href=\"https:\/\/www.google.com\/\">Google<\/a>. Click in your browser&#8217;s URL box, enter the following line, and press Enter. &nbsp;(Note that some browsers might not let you copy\/paste this as-is, so double-check that your pasted version actually includes the &#8220;javascript:&#8221; part.)<br><code>javascript: (function() {document.getElementById('hplogo').srcset = 'https:\/\/upload.wikimedia.org\/wikipedia\/commons\/2\/24\/Yahoo%21_logo.svg';})();<\/code><\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"alignleft\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"1139\" height=\"570\" src=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5815090088139.png?resize=1139%2C570&#038;ssl=1\" alt=\"\" class=\"wp-image-1192\" srcset=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5815090088139.png?w=1139&amp;ssl=1 1139w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5815090088139.png?resize=300%2C150&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5815090088139.png?resize=768%2C384&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5815090088139.png?resize=1024%2C512&amp;ssl=1 1024w\" sizes=\"auto, (max-width: 1139px) 100vw, 1139px\" \/><\/figure>\n<\/div>\n\n\n<p id=\"uxfGLKV\">Hopefully this isn&#8217;t too surprising, but let&#8217;s recap: we downloaded a page from Google,&nbsp;we ran a script, and now we&nbsp;have Google search with a Yahoo logo. &nbsp;The point is that your browser can and will run all sorts of scripts that change what you see, making the simple task of parsing HTML into one that can be much more complex.<\/p>\n\n\n\n<p>As the browser reads web pages and evaluates styles and scripts, it builds up a hierarchical representation of all the elements on the page known as the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Document_Object_Model\">Document Object Model<\/a> (the DOM). &nbsp;The DOM is&nbsp;dynamic, and holds a lot more detail than what&#8217;s actually visible in your browser. &nbsp;This is important to understand, because when you are scraping a web page, sometimes you need to parse&nbsp;the raw HTML from your request, aka the source; but sometimes, you need to&nbsp;parse the DOM.<\/p>\n\n\n\n<h2 id=\"browser-diagnostics\" class=\"wp-block-heading\">Browser Diagnostics<\/h2>\n\n\n\n<p>By far, the best&nbsp;tools for exploring the source and DOM&nbsp;are your browser&#8217;s own developer&nbsp;diagnostics. &nbsp;Each major browser comes packaged with or offers a developer panel, a way for&nbsp;folks who are building websites to debug them. &nbsp;Chrome has <a href=\"https:\/\/developer.chrome.com\/devtools\">Developer Tools<\/a>; Firefox has <a href=\"https:\/\/developer.mozilla.org\/en-US\/docs\/Tools\/Tools_Toolbox\">Toolbox<\/a>; IE\/Edge has <a href=\"https:\/\/developer.microsoft.com\/en-us\/microsoft-edge\/platform\/documentation\/f12-devtools-guide\/\">F12 Tools<\/a>; Safari has an <a href=\"https:\/\/developer.apple.com\/safari\/tools\/\">Inspector<\/a>. &nbsp;For this post (and all future posts on the subject), I&#8217;ll be referring to Chrome&#8217;s toolbox.<\/p>\n\n\n\n<p>In Chrome, open up a new tab, and head again to <a href=\"https:\/\/www.google.com\/\">Google<\/a>. \u00a0Now, right-click on the page (anywhere except the Google logo image) and select View Page Source (also available from the View -> Developer -> View Source menu). \u00a0You now see the raw HTML your browser retrieves when accessing http:\\\\www.google.com &#8211; there&#8217;s a bit of HTML, and a lot of JavaScript, all of which your browser has executed.<\/p>\n\n\n\n<p>Search the page for&nbsp;<em>hplogo<\/em> &#8211; you should see one occurrence, as (at the time of this writing) this is the id for the main logo&#8217;s image element:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"1160\" height=\"77\" src=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_581514118929f.png?resize=1160%2C77&#038;ssl=1\" alt=\"\" class=\"wp-image-1194\" srcset=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_581514118929f.png?w=1358&amp;ssl=1 1358w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_581514118929f.png?resize=300%2C20&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_581514118929f.png?resize=768%2C51&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_581514118929f.png?resize=1024%2C68&amp;ssl=1 1024w\" sizes=\"auto, (max-width: 1160px) 100vw, 1160px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"vRTTLZu\"><span id=\"inspecting-the-source\">Inspecting the Source<\/span><\/h3>\n\n\n\n<p>We can see that Google is using&nbsp;an image <a href=\"https:\/\/css-tricks.com\/responsive-images-youre-just-changing-resolutions-use-srcset\/\">srcset<\/a> to determine which image to show, picking, at least in my case,&nbsp;<a href=\"https:\/\/www.google.com\/images\/branding\/googlelogo\/2x\/googlelogo_color_272x92dp.png\">https:\/\/www.google.com\/images\/branding\/googlelogo\/2x\/googlelogo_color_272x92dp.png<\/a>.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"alignright\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"754\" height=\"350\" src=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_581514f22b1ad.png?resize=754%2C350&#038;ssl=1\" alt=\"\" class=\"wp-image-1195\" srcset=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_581514f22b1ad.png?w=754&amp;ssl=1 754w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_581514f22b1ad.png?resize=300%2C139&amp;ssl=1 300w\" sizes=\"auto, (max-width: 754px) 100vw, 754px\" \/><\/figure>\n<\/div>\n\n\n<p>&nbsp;Now, head back to the logo image, right-click on it, and select Inspect, which will bring up Chrome&#8217;s view of the DOM, highlighting the logo image we just found in the source page. &nbsp;No surprises here &#8211; the detail&nbsp;matches exactly what we saw in the source.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"1039\" height=\"140\" src=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_581515684c921.png?resize=1039%2C140&#038;ssl=1\" alt=\"\" class=\"wp-image-1196\" srcset=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_581515684c921.png?w=1039&amp;ssl=1 1039w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_581515684c921.png?resize=300%2C40&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_581515684c921.png?resize=768%2C103&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_581515684c921.png?resize=1024%2C138&amp;ssl=1 1024w\" sizes=\"auto, (max-width: 1039px) 100vw, 1039px\" \/><\/figure>\n\n\n\n<p id=\"rshVWJs\"><\/p>\n\n\n\n<p>So far, so good. &nbsp;Now, go ahead and re-run the bit of code from earlier in the post, converting the Google logo into a Yahoo one. &nbsp;The DOM should now show your new image src &#8211;&nbsp;the DOM always reflects the current state of the page. &nbsp;If, however, you re-view the page source, that will remain unchanged, as that simply shows the initial content your browser downloaded.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"1037\" height=\"115\" src=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_58151612aa96b.png?resize=1037%2C115&#038;ssl=1\" alt=\"\" class=\"wp-image-1197\" srcset=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_58151612aa96b.png?w=1037&amp;ssl=1 1037w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_58151612aa96b.png?resize=300%2C33&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_58151612aa96b.png?resize=768%2C85&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_58151612aa96b.png?resize=1024%2C114&amp;ssl=1 1024w\" sizes=\"auto, (max-width: 1037px) 100vw, 1037px\" \/><\/figure>\n\n\n\n<p id=\"ApWOKly\"><\/p>\n\n\n\n<p>I&#8217;m belaboring the point because it&#8217;s important: the Python libraries&nbsp;we used in the prior post to&nbsp;retrieve the web page only give us access to the source; they do not construct a document object model, they do not retrieve linked styles or scripts, and they do not execute any JavaScript. &nbsp;In many cases, the data you want is in the DOM, and in those cases you need to do more work to get at it.<\/p>\n\n\n\n<p>We&#8217;ll be covering parsing the DOM in future posts; for the remainder of this one, let&#8217;s dig a bit into parsing the source, the raw HTML.<\/p>\n\n\n\n<h2 id=\"parsing-html\" class=\"wp-block-heading\">Parsing HTML<\/h2>\n\n\n\n<p>An HTML document is really just some text that usually conforms to the <a href=\"https:\/\/en.wikipedia.org\/wiki\/HTML\">HTML standard<\/a>. \u00a0Many web pages loosely conform to the HTML standard, and many others don&#8217;t conform at all. \u00a0Most browsers are smart enough to deal with poorly written HTML. \u00a0As HTML is just structured text, <a href=\"https:\/\/en.wikipedia.org\/wiki\/Regular_expression\">regular expressions<\/a> in many cases can extract the content of interest. \u00a0For example, even\u00a0on the shittiest of web pages, <a href=\"http:\/\/www.w3schools.com\/tags\/tag_a.asp\">anchor elements<\/a> still contain links.<\/p>\n\n\n\n<p>Let&#8217;s again use the exercise of figuring out the titles of the posts on my blog&#8217;s landing page. &nbsp;Python makes regex very easy &#8211; the hard part is&nbsp;constructing the right expression.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport urllib2\nimport re\n\nhtml = urllib2.urlopen('http:\/\/www.craigperler.com').read()\n\nre.findall(r'&lt;a.*www.craigperler.com\/blog\/2.*&gt;(&#x5B;A-Za-z]+&#x5B;A-Za-z1-9 ]+)&lt;\/a&gt;', html)\n'''&#x5B;'On Web Scraping',\n'Tracking Personal Finances',\n'Better Babies',\n'I Type Faster Than You',\n'Side Projects',\n'Plumbing and Coding',\n'On Web Scraping',\n'Tracking Personal Finances',\n'Better Babies',\n'I Type Faster Than You']'''\n<\/pre><\/div>\n\n\n<p>Deconstructing that regex, it says to:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Find any text that starts with &#8220;&lt;a&#8221;<\/li><li>That is followed by any characters which must be followed by the text &#8220;www.craigperler.com\/blog\/2&#8221;<\/li><li>Which is then followed by any characters which must then be followed by &#8220;>&#8221;<\/li><li>Thereafter, we capture a group, where the captured text must start with a letter [A-Za-z], but can be followed by a letter, number or space [A-Za-z1-9 ]<\/li><li>And the group must be followed by &#8220;&lt;\/a>&#8221;<\/li><\/ol>\n\n\n\n<p>Regex does the job&#8230; in this very simple case. What if we wanted to grab any blog post title on the landing page that&#8217;s tagged with &#8220;python&#8221; or &#8220;development&#8221;? Figuring out the regex for this would be a tremendous task. Of course, there are better ways for parsing data from HTML or HTML-like content.<\/p>\n\n\n\n<h2 id=\"parsing-libraries\" class=\"wp-block-heading\">Parsing Libraries<\/h2>\n\n\n\n<p>There are two main libraries when it comes to HTML parsing: <a href=\"http:\/\/lxml.de\/\">lxml<\/a>\u00a0and\u00a0<a href=\"https:\/\/www.crummy.com\/software\/BeautifulSoup\/\">BeautifulSoup<\/a>. \u00a0There&#8217;s a lot of overlap across the libraries, and in general, each provides an API that wraps the others. \u00a0There are differences in <a href=\"http:\/\/romanvm.pythonanywhere.com\/post\/comparison-html5-parsers-gumbo-vs-html5lib-12\/\">performance<\/a> and probably in some <a href=\"http:\/\/stackoverflow.com\/questions\/22696961\/beautifulsoup-lxml-and-html5lib-parsers-scraping-differences\">edge case behaviors<\/a>, but in general the choice in package is a matter of preference (IMO). With all of these libraries, there are a couple ways of accomplishing the same goal. A few highlights are below.<\/p>\n\n\n\n<h3 id=\"lxml\" class=\"wp-block-heading\">LXML<\/h3>\n\n\n\n<p>One approach with LXML is using XPath to traverse the document. <a href=\"https:\/\/en.wikipedia.org\/wiki\/XPath\">XPath<\/a> is a language for querying an XML document, and as well-structured HTML follows an XML-defined schema, we can often use XPath as a way of parsing HTML. With XPath, we use the HTML tags and attributes to move throughout the document and pull out the fields of interest. In the following example, we use the XPath string,&nbsp;&#8220;<em>\/\/a[@itemprop=&#8217;mainEntityOfPage&#8217;]\/text()<\/em>&#8220;. &nbsp;This means, starting from the root of the document, find all anchor tags that have an itemprop attribute that contains only the value &#8220;mainEntityOfPage&#8221;, and from each of those tags, pull out the contained text. &nbsp;This XPath relies on the fact that each of the landing page blog post links have that itemprop attribute, something we can only know from looking at the page source or inspecting the blog.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"992\" height=\"101\" src=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_581576c3cc2dc.png?resize=992%2C101&#038;ssl=1\" alt=\"\" class=\"wp-image-1202\" srcset=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_581576c3cc2dc.png?w=992&amp;ssl=1 992w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_581576c3cc2dc.png?resize=300%2C31&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_581576c3cc2dc.png?resize=768%2C78&amp;ssl=1 768w\" sizes=\"auto, (max-width: 992px) 100vw, 992px\" \/><\/figure>\n\n\n\n<p id=\"dZiWyjZ\"><\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; highlight: [17]; title: ; notranslate\" title=\"\">\nimport urllib2\nresponse = urllib2.urlopen('http:\/\/www.craigperler.com')\n\n# Build a document tree representation of the HTML content:\nfrom lxml import etree\ntree = etree.parse(response, etree.HTMLParser())\n\n# Use xpath to find the \"nodes\" of interest:\ntree.xpath(\"\/\/a&#x5B;@itemprop='mainEntityOfPage']\/text()\")\n'''&#x5B;'Web Scraping, Part 1',\n 'On Web Scraping',\n 'Tracking Personal Finances',\n 'ProjectSHERPA: a startup retrospective',\n 'Better Babies',\n 'Can Yelp Reviews Predict Real Estate Prices?',\n 'I Type Faster Than You',\n u'Don\\u2019t Look at My Baby',\n 'Side Projects',\n 'Plumbing and Coding']'''\n<\/pre><\/div>\n\n\n<p>One thing to note here is that we get more posts back than the regex method. &nbsp;This is because our regex pattern was not accounting for&nbsp;unicode characters, so it didn&#8217;t find the &#8220;Don&#8217;t Look at My Baby&#8221; link, which has the unicode apostrophe&nbsp;(<a href=\"http:\/\/www.fileformat.info\/info\/unicode\/char\/2019\/index.htm\">\\u2019<\/a>). &nbsp;The XPath doesn&#8217;t care what characters are in the text &#8211; it&#8217;s just more forgiving than our very specific regex pattern.<\/p>\n\n\n\n<h4 id=\"css-selector\" class=\"wp-block-heading\">CSS Selector<\/h4>\n\n\n\n<p>Another option with LXML is to use a CSS selector. &nbsp;CSS defines how to apply styling to HTML content; <a href=\"https:\/\/en.wikipedia.org\/wiki\/Cascading_Style_Sheets#Selector\">CSS selectors<\/a> are a way of identifying the elements to which a given styling rule should apply. &nbsp;So, just like XPath can find the part of the document that has the content we want, CSS selectors can do the same. &nbsp;We can use the same itemprop with CSS selectors by querying for&nbsp;&#8220;<em>a[itemprop=&#8217;mainEntityOfPage&#8217;]<\/em>&#8220;. &nbsp;This means the same thing as it did earlier, but uses a different syntax. &nbsp;The code looks like:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport urllib2\nresponse = urllib2.urlopen('http:\/\/www.craigperler.com')\n\n# Build a document tree representation of the HTML content:\nfrom lxml import etree\ntree = etree.parse(response, etree.HTMLParser()) \n\n# Use CSS to select the tags of interest:\nfrom lxml.cssselect import CSSSelector\nselector = CSSSelector('a&#x5B;itemprop=\"mainEntityOfPage\"]')\n\n# Print out the text attribute of each selected HTML tag:\n&#x5B;element.text for element in selector(tree)]\n<\/pre><\/div>\n\n\n<h3 id=\"beautifulsoup\" class=\"wp-block-heading\">BeautifulSoup<\/h3>\n\n\n\n<p>BeautifulSoup doesn&#8217;t offer an XPath option, but does support a subset of CSS selection in addition to its own API.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport urllib2\nresponse = urllib2.urlopen('http:\/\/www.craigperler.com').read()\n\n# Make the soup (wrap the HTML response in a BeautifulSoup instance):\nfrom bs4 import BeautifulSoup\nsoup = BeautifulSoup(response)\n\n# Pull out the links using CSS selection:\nlinks = soup.select('a&#x5B;itemprop=\"mainEntityOfPage\"]')\n&#x5B;link.text for link in links]\n<\/pre><\/div>\n\n\n<p>It also offers its own API that may or may not be more intuitive depending on your tastes:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport urllib2\nresponse = urllib2.urlopen('http:\/\/www.craigperler.com').read()\n\n# Make the soup (wrap the HTML response in a BeautifulSoup instance):\nfrom bs4 import BeautifulSoup\nsoup = BeautifulSoup(response)\n\n# Use the soup's findAll call to retrieve the links:\nlinks = soup.findAll('a', itemprop='mainEntityOfPage')\n&#x5B;link.text for link in links]\n<\/pre><\/div>\n\n\n<p>Using XPath, CSS, and regex, it&#8217;s possible to identify and extract most elements in a well-structured web page. There are of course cases where the authors obfuscate pages, which require more involved approaches.<\/p>\n\n\n\n<p>That said, simply knowing how to retrieve and parse web pages, you can get stuff done with web scraping. We can build a very rudimentary web crawler that will analyze a page for links, and then traverse those related pages for more links, until an entire site has been exhaustively searched. This is a recursive or iterative process &#8211; as each page is scraped, we need to re-run the scraping process on all of the related links of interest.<\/p>\n\n\n\n<h2 id=\"a-complete-parsing-example\" class=\"wp-block-heading\">A Complete Parsing Example<\/h2>\n\n\n\n<p>The following bit of code will crawl through my entire blog site, starting with the landing page. With each scraped page, it will identify all other links that point to places on my blog, and will build a dictionary that holds link relationships (which page points to which other pages).<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport urllib2\nfrom lxml import etree\n\n# The starting URL for the crawler:\nstart_url = 'http:\/\/craigperler.com'\n\n# Given a URL, return a list of all the links that point to other pages on my blog.\n# Uses XPath to filter only links on craigperler.com\/blog.\ndef get_links(url):\n    try:\n        response = urllib2.urlopen(url)\n        tree = etree.parse(response, etree.HTMLParser())\n        return tree.xpath(\"\/\/a&#x5B;contains(@href, 'craigperler.com\/blog')]\/@href\")\n    except Exception as e:\n        # Some linked pages on my site such as MIDI files aren't retrievable or parse-able in this\n        # way.  Rather than account for those, this simply ignores attempts to pull them down.\n        return &#x5B;]\n\n# Run the crawler, starting with the start_url, and using the link_retriever function\n# for extracting links from the scraped pages.\ndef crawl(start_url, link_retriever=get_links):\n    links = {} # Holds the dictionary of links to return at the end.\n    stack = &#x5B;] # A (fake) stack structure to track what pages to crawl next.\n\n    # Start the crawler by retrieving the start_url:\n    stack.append(start_url)\n\n    # While there are pages to crawl:\n    while len(stack) &gt; 0:\n        # Get the next URL to crawl, \"popping\" it from the stack:\n        next_url = stack&#x5B;0]\n        del stack&#x5B;0]\n\n        # Add the URL about-to-be scraped to our list of scraped URLs:\n        links&#x5B;next_url] = &#x5B;]\n\n        # Get the links on the URL:\n        for link in link_retriever(next_url):\n            links&#x5B;next_url].append(link)\n\n            # Clean each retrieved link - this is just to speed up the process:\n            if '#' in link:\n                link = link.split('#')&#x5B;0]\n            if '?' in link:\n                link = link.split('?')&#x5B;0]\n            if not link.endswith('\/'):\n                link += '\/'\n\n            # Add each link not yet seen to the stack, so that it will in turn be crawled:\n            if link not in links:\n                stack.append(link)\n                links&#x5B;link] = &#x5B;]\n\n    return links\n<\/pre><\/div>","protected":false},"excerpt":{"rendered":"<p>In the last post, we saw that many times the web page you request is not the page you receive. Similarly, often the page you&#8217;ve downloaded looks nothing like the&hellip;<\/p>\n","protected":false},"author":1,"featured_media":1215,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[64],"tags":[],"powerkit_post_featured":[],"class_list":{"0":"post-1187","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-application-development"},"jetpack_featured_media_url":"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816705ca5bf8.png?fit=890%2C593&ssl=1","jetpack_shortlink":"https:\/\/wp.me\/p1SwZ6-j9","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/posts\/1187","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/comments?post=1187"}],"version-history":[{"count":4,"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/posts\/1187\/revisions"}],"predecessor-version":[{"id":1579,"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/posts\/1187\/revisions\/1579"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/media\/1215"}],"wp:attachment":[{"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/media?parent=1187"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/categories?post=1187"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/tags?post=1187"},{"taxonomy":"powerkit_post_featured","embeddable":true,"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/powerkit_post_featured?post=1187"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}