{"id":1221,"date":"2016-10-31T17:00:38","date_gmt":"2016-10-31T21:00:38","guid":{"rendered":"http:\/\/www.craigperler.com\/blog\/?p=1221"},"modified":"2024-06-06T23:25:25","modified_gmt":"2024-06-07T03:25:25","slug":"web-scraping-part-3-case-study-1","status":"publish","type":"post","link":"https:\/\/www.craigperler.com\/blog\/2016\/10\/31\/web-scraping-part-3-case-study-1\/","title":{"rendered":"Web Scraping, Part 3: Scraping Case Study"},"content":{"rendered":"<div class=\"wp-block-image\">\n<figure class=\"alignleft\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"340\" height=\"320\" src=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816c1685c33f.png?resize=340%2C320&#038;ssl=1\" alt=\"\" class=\"wp-image-1240\" srcset=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816c1685c33f.png?w=340&amp;ssl=1 340w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816c1685c33f.png?resize=120%2C113&amp;ssl=1 120w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816c1685c33f.png?resize=90%2C85&amp;ssl=1 90w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816c1685c33f.png?resize=320%2C301&amp;ssl=1 320w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816c1685c33f.png?resize=300%2C282&amp;ssl=1 300w\" sizes=\"auto, (max-width: 340px) 100vw, 340px\" \/><\/figure>\n<\/div>\n\n\n<p id=\"PMqpxNk\">In this post, we&#8217;re going to use what was covered on web scraping in the first two\u00a0posts (<a href=\"https:\/\/www.craigperler.com\/blog\/2016\/10\/27\/web-scraping-part-1\/\">#1<\/a> and <a href=\"https:\/\/www.craigperler.com\/blog\/2016\/10\/30\/web-scraping-part-2-parsing\/\">#2<\/a>) in this series for a web scraping case study: scraping contact details for all of the jewelers certified by the <a href=\"https:\/\/www.americangemsociety.org\/en\/\">American Gem Society<\/a>. \u00a0The AGS is an\u00a0organization that &#8220;[helps] protect the jewelry-buying public from fraud and false advertising.&#8221; \u00a0The AGS makes no secrets about providing access to their member lists &#8211; they want to promote their jeweler members. \u00a0Consequently, it&#8217;s quite easy to search for AGS members by <a href=\"https:\/\/www.americangemsociety.org\/en\/find-a-jeweler#find-a-jeweler-by-zip\">zip code<\/a>, state, or even <a href=\"http:\/\/www.americangemsociety.org\/en\/findapro\">name<\/a>. \u00a0It&#8217;s not as easy, however, to collate a full list of all members, which is what we&#8217;ll be doing in this exercise.<\/p>\n\n\n\n<p>Now, web scraping AGS may seem like a random place to start. \u00a0It would be, except that we&#8217;re starting here because the site is easily scrape-able and <a href=\"https:\/\/www.upwork.com\/job\/Web-Scrape-Data-Extract-two-websites_~01f511088c2282ae8f\/\">someone on Upwork<\/a> was willing to front up to $250 for this data set. \u00a0Why not try to make a buck while learning this stuff?<\/p>\n\n\n\n<h2 id=\"robots-txt\" class=\"wp-block-heading\">Robots.txt<\/h2>\n\n\n\n<p>Before we dive into it, one quick tangent on ethics. \u00a0As noted in my post <a href=\"https:\/\/www.craigperler.com\/blog\/2016\/10\/24\/on-web-scraping\/\">On Scraping<\/a>,\u00a0web scraping is often against a site&#8217;s terms of use. \u00a0Even if the data to be scraped is not going to be used, analyzed, or sold, simply the act of scraping could violate your implicit contract with the web site. \u00a0There are two things we need to check first: the public Terms of Use for the site, and the site&#8217;s <a href=\"http:\/\/www.robotstxt.org\/robotstxt.html\">robots.txt<\/a> file.<\/p>\n\n\n\n<p>Skimming over&nbsp;<a href=\"https:\/\/www.americangemsociety.org\/en\/\">https:\/\/www.americangemsociety.org\/en\/<\/a>, I don&#8217;t see any signs of a Terms of Use. &nbsp;Ordinarily, if a site is providing you some free service (a la Facebook or Twitter), or is e-commerce in any way, there&#8217;s a Terms. &nbsp;Usually, there&#8217;s a link to the Terms in the footer.<\/p>\n\n\n\n<p>AGS does, however, have a <a href=\"https:\/\/www.americangemsociety.org\/robots.txt\">robots.txt<\/a> file, copied here:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"459\" height=\"78\" src=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816ad691cc54.png?resize=459%2C78&#038;ssl=1\" alt=\"\" class=\"wp-image-1222\" srcset=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816ad691cc54.png?w=459&amp;ssl=1 459w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816ad691cc54.png?resize=300%2C51&amp;ssl=1 300w\" sizes=\"auto, (max-width: 459px) 100vw, 459px\" \/><\/figure>\n\n\n\n<p id=\"FzDkYXP\"><\/p>\n\n\n\n<p>The robots file is a list of rules that the site requests we abide by. &nbsp;The list&#8217;s intended for&nbsp;scrapers and bots &#8211; any bit of code that is programmatically accessing the site. &nbsp;Google, for example, has a bot that crawls the web, indexing everything to make Search work for us. &nbsp;The robots file is not enforceable, and so unsurprisingly, any bot that has malicious intents is probably not going to check the robots file to review the requested policy.<\/p>\n\n\n\n<p>The AGS robots file says that for any user of the site (&#8220;User-agent: *&#8221;), you cannot programmatically access any URL under the admin path (&#8220;\/admin\/&#8221;). \u00a0In effect, we&#8217;re free to scrape AGS, as long as we stay away from the admin pages. \u00a0Fair enough. \u00a0Note that robots files can and will be much more complex &#8211; <a href=\"https:\/\/www.linkedin.com\/robots.txt\">LinkedIn&#8217;s<\/a>, for example.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"alignright\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"859\" height=\"640\" src=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816af266f013.png?resize=859%2C640&#038;ssl=1\" alt=\"\" class=\"wp-image-1223\" srcset=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816af266f013.png?w=859&amp;ssl=1 859w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816af266f013.png?resize=300%2C224&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816af266f013.png?resize=768%2C572&amp;ssl=1 768w\" sizes=\"auto, (max-width: 859px) 100vw, 859px\" \/><\/figure>\n<\/div>\n\n\n<p><\/p>\n\n\n\n<h2 id=\"web-scraping-case-study-objective\" class=\"wp-block-heading\">Web Scraping Case Study Objective<\/h2>\n\n\n\n<p>We want to extract from AGS the full list of certified jewelers with their contact info and certification status. \u00a0For example, this is the page for <a href=\"https:\/\/www.americangemsociety.org\/en\/arthur-weeks-son-jewelers\">Arthur Weeks &amp; Son Jewelers in NY<\/a>. \u00a0Note that there are separate lines for each of the fields we want to extract, and for each of those fields the heading is in bold. \u00a0If you want to find the phone number, it might be as easy as finding the emboldened &#8220;Phone:&#8221; text, and then grabbing the next bit of content. \u00a0Before we jump into parsing this page, however, we need to figure out how to get to this page. \u00a0If we want all of the jewelers, it means we need a list of jewelers or URLs, and as we saw, AGS only provides access to jewelers by zip, state, or search.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"alignleft\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"845\" height=\"617\" src=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816aff703432.png?resize=845%2C617&#038;ssl=1\" alt=\"\" class=\"wp-image-1224\" srcset=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816aff703432.png?w=845&amp;ssl=1 845w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816aff703432.png?resize=300%2C219&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816aff703432.png?resize=768%2C561&amp;ssl=1 768w\" sizes=\"auto, (max-width: 845px) 100vw, 845px\" \/><\/figure>\n<\/div>\n\n\n<p id=\"GHIPItd\"><\/p>\n\n\n\n<p id=\"UZQSaAI\">At first blush, search by state seems promising. &nbsp;We could list out each state, and construct a link based on the state name.&nbsp;If we click on NY state, for example, it opens up a link to&nbsp;<a href=\"https:\/\/www.americangemsociety.org\/en\/newyork-jewelers\">https:\/\/www.americangemsociety.org\/en\/<strong>newyork<\/strong>-jewelers<\/a>. &nbsp;Perhaps we can just construct these URLs by listing out every state.<\/p>\n\n\n\n<p>The page for each state presents another problem, as they contain lists of jewelers with some details and some links. We would need to figure out how to get from a state&#8217;s page to each individual jeweler&#8217;s page.<\/p>\n\n\n\n<h2 id=\"traversing-the-mental-map\" class=\"wp-block-heading\">Traversing the Mental Map<\/h2>\n\n\n\n<p>What we&#8217;ve just done is built a mental map of a traversal scheme. &nbsp;We started with our endpoint, the individual jeweler pages, and&nbsp;worked our way backwards to a starting point. &nbsp;The map of states has links to summaries for each state, and those summaries contain links to each of the jeweler pages. &nbsp;We may be able to construct URLs for each state&#8217;s page, but would need to do some parsing&nbsp;from there. &nbsp;Let&#8217;s take each of these pages at a time now.<\/p>\n\n\n\n<p>Let&#8217;s take a closer look at the state links. &nbsp;While it may be easy enough to type out all the states and format URLs per state, I don&#8217;t want to do that &#8211; it sounds like a lot of work. &nbsp;We should be able to write some code to find all the state links. &nbsp;Certainly, if we just pull all the links on this page, we&#8217;ll get back way more than just those for the state pages &#8211; this would include links to other areas of the site, all the stuff on the footer, and the social media icons. &nbsp;We need to narrow this down. &nbsp;Inspecting the NY link, we see this:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"991\" height=\"90\" src=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816b276d9871.png?resize=991%2C90&#038;ssl=1\" alt=\"\" class=\"wp-image-1225\" srcset=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816b276d9871.png?w=991&amp;ssl=1 991w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816b276d9871.png?resize=300%2C27&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816b276d9871.png?resize=768%2C70&amp;ssl=1 768w\" sizes=\"auto, (max-width: 991px) 100vw, 991px\" \/><\/figure>\n\n\n\n<p id=\"BhjufLi\"><\/p>\n\n\n\n<p>The link is actually defined with an <a href=\"http:\/\/www.w3schools.com\/tags\/tag_area.asp\">area<\/a> element, which is an image map with clickable coordinates. &nbsp;Looking over the page, it doesn&#8217;t look like there are any other image maps &#8211; we should be able to filter for just the area links to get the URLs for each state&#8217;s page.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport urllib2\nresponse = urllib2.urlopen('https:\/\/www.americangemsociety.org\/en\/find-a-jeweler')\n\nfrom lxml import etree\ntree = etree.parse(response, etree.HTMLParser())\n\ntree.xpath('\/\/area\/@href')\n'''&#x5B;'\/en\/alabama-jewelers',\n '\/en\/alaska-jewelers',\n '\/en\/arizona-jewelers',\n '\/en\/arkansas-jewelers',\n...\n '\/en\/westvirginia-jewelers',\n '\/en\/wisconsin-jewelers',\n '\/en\/wyoming-jewelers']'''\n<\/pre><\/pre>\n\n\n\n<p>There you go. We can use those links and programmatically get to each state&#8217;s page. Next step, we need to see if there&#8217;s a similar shortcut to get from each state page to each jeweler&#8217;s page.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"alignleft\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"713\" height=\"410\" src=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816b46d69ecb.png?resize=713%2C410&#038;ssl=1\" alt=\"\" class=\"wp-image-1229\" srcset=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816b46d69ecb.png?w=713&amp;ssl=1 713w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816b46d69ecb.png?resize=300%2C173&amp;ssl=1 300w\" sizes=\"auto, (max-width: 713px) 100vw, 713px\" \/><\/figure>\n<\/div>\n\n\n<p id=\"gZrDjIK\"><\/p>\n\n\n\n<p>First step is to inspect these links and see if there&#8217;s a key attribute that&#8217;s common to each of them, and not common to anything else on the page. &nbsp;It appears we&#8217;re in luck.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"alignright\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"643\" height=\"98\" src=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816b52a6989a.png?resize=643%2C98&#038;ssl=1\" alt=\"\" class=\"wp-image-1230\" srcset=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816b52a6989a.png?w=643&amp;ssl=1 643w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816b52a6989a.png?resize=300%2C46&amp;ssl=1 300w\" sizes=\"auto, (max-width: 643px) 100vw, 643px\" \/><\/figure>\n<\/div>\n\n\n<p id=\"NslCAyS\"><\/p>\n\n\n\n<p>We can see that each jeweler link has the class &#8220;jeweler__link&#8221; making it easy to use XPath or CSS to pick just these links from the page.<\/p>\n\n\n\n<p>XPath: <code>\/\/a[@class=\"jeweler__link\"]\/@href<\/code><br>CSS: <code>a.jeweler__link<\/code><\/p>\n\n\n\n<h2 id=\"crawling\" class=\"wp-block-heading\">Crawling<\/h2>\n\n\n\n<p>Now we know how to crawl to the pages we want to hit, and it&#8217;s just a matter of parsing those pages for the data. Let&#8217;s take a closer look at <a href=\"https:\/\/www.americangemsociety.org\/en\/arthur-weeks-son-jewelers\">Arthur Weeks &amp; Son<\/a>.<\/p>\n\n\n\n<p>Our goal is to generate some JSON for each jeweler &#8211; what is the name of the jeweler, what grading do they offer, who are the certified AGS members, what is their address, their phone number, and their web page &#8211; this is basically all of the information available per jeweler. We need to check each of these fields one at a time, inspecting for ways we can query for just the data we want, either using XPath or CSS.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"531\" height=\"62\" src=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816b73e8a882.png?resize=531%2C62&#038;ssl=1\" alt=\"\" class=\"wp-image-1231\" srcset=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816b73e8a882.png?w=531&amp;ssl=1 531w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816b73e8a882.png?resize=300%2C35&amp;ssl=1 300w\" sizes=\"auto, (max-width: 531px) 100vw, 531px\" \/><\/figure>\n\n\n\n<p id=\"UoyJwpI\"><\/p>\n\n\n\n<p>The&nbsp;jeweler name seems easy enough. &nbsp;It&#8217;s the largest thing on the page, and the only text in an h1 tag.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"542\" height=\"72\" src=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816b7cc8374c.png?resize=542%2C72&#038;ssl=1\" alt=\"\" class=\"wp-image-1232\" srcset=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816b7cc8374c.png?w=542&amp;ssl=1 542w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816b7cc8374c.png?resize=300%2C40&amp;ssl=1 300w\" sizes=\"auto, (max-width: 542px) 100vw, 542px\" \/><\/figure>\n\n\n\n<p id=\"BATeziz\"><\/p>\n\n\n\n<p>The appraiser grading is similarly identifiable. &nbsp;Each grading is emphasized text held in a p tag that has the class &#8220;appraiser__grading&#8221;.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"511\" height=\"104\" src=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816b8580924f.png?resize=511%2C104&#038;ssl=1\" alt=\"\" class=\"wp-image-1234\" srcset=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816b8580924f.png?w=511&amp;ssl=1 511w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816b8580924f.png?resize=300%2C61&amp;ssl=1 300w\" sizes=\"auto, (max-width: 511px) 100vw, 511px\" \/><\/figure>\n\n\n\n<p>The list of certified members is a bit more complex. &nbsp;Each member is held in a list item, but that list (ul tag) is not underneath any easily identifiable tag. &nbsp;Instead, the list is &#8220;next&#8221; to a p tag with the class &#8220;appraiser__certified&#8221;. &nbsp;In other words, to get to the list of appraisers, you need to find the &#8220;appraiser__certified&#8221; tag, go to the next ul HTML element on the same level as it, and then grab the content from its contained li items. &nbsp;Deconstructing this:<\/p>\n\n\n\n<p>First, find the p tag with the &#8220;appraiser__certified&#8221; class:<br><code>\/\/p[@class=\"appraiser__certified\"]<\/code><\/p>\n\n\n\n<p>From there, we need the following ul tag &#8211; the key bit of XPath here is <i>following-sibling<\/i> which says find the next element at the same level:<br><code>.\/following-sibling::ul<\/code><\/p>\n\n\n\n<p>And then we need all of the text contained in the list items within that ul:<br><code>.\/\/li\/text()<\/code><\/p>\n\n\n\n<p>Altogether, that looks like:<br><code>\/\/p[@class=\"appraiser__certified\"]\/following-sibling::ul\/\/li\/text()<\/code><\/p>\n\n\n\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"403\" height=\"60\" src=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816bbd827f2e.png?resize=403%2C60&#038;ssl=1\" alt=\"\" class=\"wp-image-1237\" srcset=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816bbd827f2e.png?w=403&amp;ssl=1 403w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816bbd827f2e.png?resize=300%2C45&amp;ssl=1 300w\" sizes=\"auto, (max-width: 403px) 100vw, 403px\" \/><\/figure>\n\n\n\n<p>The address, phone number, and website are all easily accessible. &nbsp;(Note the address is actually held under a tag with class appraiser___hours.)<\/p>\n\n\n\n<h2 id=\"putting-the-scraper-together\" class=\"wp-block-heading\">Putting the Scraper Together<\/h2>\n\n\n\n<p>That&#8217;s it &#8211; we have everything we need to build a web crawler and scrape this data. &nbsp;Here&#8217;s an example that simply iterates over all the links we get, traversing to the jeweler pages, and then uses XPath to get the data:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nimport urllib2\nfrom lxml import etree\n\ndomain = 'https:\/\/www.americangemsociety.org'\n\n# Utility method for parsing the main page for each of the state links:\ndef get_state_links(url='https:\/\/www.americangemsociety.org\/en\/find-a-jeweler'):\n    response = urllib2.urlopen(url)\n    tree = etree.parse(response, etree.HTMLParser())\n    return &#x5B;'{}{}'.format(domain, href) for href in tree.xpath('\/\/area\/@href')]\n\n# Utility method for parsing a state page and getting each of the jeweler links:\ndef get_jeweler_links(url):\n    response = urllib2.urlopen(url)\n    tree = etree.parse(response, etree.HTMLParser())\n    return &#x5B;'{}{}'.format(domain, href) for href in tree.xpath('\/\/a&#x5B;@class=\"jeweler__link\"]\/@href')]\n\n# Given a URL to a jeweler's page, retrieve the HTML response and use XPath to pull out\n# all of the data we want into a dictionary:\ndef parse_jeweler_page(url):\n    response = urllib2.urlopen(url)\n    tree = etree.parse(response, etree.HTMLParser())\n\n    fields = {\n        'name': '\/\/h1&#x5B;@class=\"page__heading\"]\/text()',\n        'grading': '\/\/p&#x5B;@class=\"appraiser__grading\"]\/strong\/text()',\n        'certified': '\/\/p&#x5B;@class=\"appraiser__certified\"]\/following-sibling::ul\/\/li\/text()',\n        'address': '\/\/p&#x5B;@class=\"appraiser__hours\"]\/text()',\n        'phone': '\/\/p&#x5B;@class=\"appraiser__phone\"]\/text()',\n        'website': '\/\/p&#x5B;@class=\"appraiser__website\"]\/a\/@href'\n    }\n\n    return {k: tree.xpath(v) for k, v in fields.items()}\n\ndef crawl():\n    jeweler_data = &#x5B;]\n\n    # Crawl through each of the state links...\n    for state_link in get_state_links():\n\n        # For each state, crawl through each of the jeweler links...\n        for jeweler_link in get_jeweler_links(state_link):\n\n            # And for each jeweler, store the extracted data:\n            jeweler_data.append(parse_jeweler_page(jeweler_link))\n\n    return jeweler_data\n\n'''&#x5B;{'address': &#x5B;' 4355 Montgomery HwySte 2, Dothan, AL 36303-1696'],\n  'certified': &#x5B;'Ronny Lisenby, RJ'],\n  'grading': &#x5B;],\n  'name': &#x5B;\"Bradshaw's Jewelers\"],\n  'phone': &#x5B;' (334) 793-6363'],\n  'website': &#x5B;'http:\/\/www.bradshawjewelers.com\/']},\n {'address': &#x5B;' 333 Fairhope Ave, Fairhope, AL 36532-2317'],\n  'certified': &#x5B;'Michael Brenny, CGA'],\n  'grading': &#x5B;],\n  'name': &#x5B;\"Brenny's Jewelry Co.\"],\n  'phone': &#x5B;' (251) 928-3916'],\n  'website': &#x5B;'http:\/\/brennysjewelry.com\/']},\n...\n]'''\n<\/pre><\/div>\n\n\n<p>Though we have the data here, we&#8217;re not really done yet. First, note that each of the values pulled back by XPath is a list. We don&#8217;t really want that. For example, the name of each jeweler isn&#8217;t a list of stuff, but just a string &#8211; the one name. We can use XPath to pull back the first value for each retrieved field, or we can post-process the extracted results and just grab the first element in each of these lists. Second, it&#8217;s one thing to extract data, but we probably want to persist it somewhere as well &#8211; save it to a file or database, for example. In any event, hopefully this served as a straightforward web scraping case study.<\/p>\n\n\n\n<p>The next step from here is re-visiting this same example, but leveraging <a href=\"https:\/\/scrapy.org\/\">Scrapy<\/a>, the standard when it comes to web scraping libraries in Python. We&#8217;ll go through that in the next post.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this post, we&#8217;re going to use what was covered on web scraping in the first two\u00a0posts (#1 and #2) in this series for a web scraping case study: scraping&hellip;<\/p>\n","protected":false},"author":1,"featured_media":1240,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[64],"tags":[],"powerkit_post_featured":[],"class_list":{"0":"post-1221","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-application-development"},"jetpack_featured_media_url":"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/img_5816c1685c33f.png?fit=340%2C320&ssl=1","jetpack_shortlink":"https:\/\/wp.me\/p1SwZ6-jH","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/posts\/1221","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/comments?post=1221"}],"version-history":[{"count":4,"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/posts\/1221\/revisions"}],"predecessor-version":[{"id":1582,"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/posts\/1221\/revisions\/1582"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/media\/1240"}],"wp:attachment":[{"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/media?parent=1221"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/categories?post=1221"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/tags?post=1221"},{"taxonomy":"powerkit_post_featured","embeddable":true,"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/powerkit_post_featured?post=1221"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}