{"id":1247,"date":"2016-11-05T10:53:05","date_gmt":"2016-11-05T14:53:05","guid":{"rendered":"http:\/\/www.craigperler.com\/blog\/?p=1247"},"modified":"2024-06-06T23:25:25","modified_gmt":"2024-06-07T03:25:25","slug":"web-scraping-part-4-scrapy","status":"publish","type":"post","link":"https:\/\/www.craigperler.com\/blog\/2016\/11\/05\/web-scraping-part-4-scrapy\/","title":{"rendered":"Web Scraping, Part 4 &#8211; Scrapy"},"content":{"rendered":"<div class=\"wp-block-image\">\n<figure class=\"alignleft\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"322\" height=\"100\" src=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/11\/img_581cbf682b303.png?resize=322%2C100&#038;ssl=1\" alt=\"\" class=\"wp-image-1248\" srcset=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/11\/img_581cbf682b303.png?w=322&amp;ssl=1 322w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/11\/img_581cbf682b303.png?resize=300%2C93&amp;ssl=1 300w\" sizes=\"auto, (max-width: 322px) 100vw, 322px\" \/><\/figure>\n<\/div>\n\n\n<p>While it&#8217;s all well and good to know how to retrieve pages, parse them, and build a crude web crawler, there&#8217;s no reason to reinvent the wheel given there are many open sources libraries that package this up for you. &nbsp;The most prolific Python library for scraping is <a href=\"https:\/\/scrapy.org\/\">Scrapy<\/a>.<\/p>\n\n\n\n<p>In this post,&nbsp;I&#8217;ll use Scrapy to&nbsp;scrape the <a href=\"https:\/\/www.americangemsociety.org\/en\/\">American Gem Society<\/a>, the same site we scraped in the <a href=\"https:\/\/www.craigperler.com\/blog\/2016\/10\/31\/web-scraping-part-3-case-study-1\/\">last post<\/a>.<\/p>\n\n\n\n<h2 id=\"what-is-scrapy\" class=\"wp-block-heading\">What is Scrapy?<\/h2>\n\n\n\n<p>Scrapy is a modular framework for crawling and parsing the web. &nbsp;&#8220;Modular&#8221; here means that each component of Scrapy is not just customizable, but can be re-written on its own and plugged back in. &nbsp;By far, the best way to learn about the library is via the <a href=\"https:\/\/doc.scrapy.org\/en\/latest\/intro\/overview.html\">official documentation<\/a>. &nbsp;(What we cover here is but a fraction of the feature set&nbsp;offered by Scrapy.)<\/p>\n\n\n\n<p>We develop with Scrapy by building &#8220;spiders&#8221; &#8211; bits of code that crawl through the web. &nbsp;A spider&nbsp;defines what we want to scrape and in some cases how we want to scrape it. &nbsp;As spiders traverse the web, they hand back &#8220;items&#8221; &#8211; the data we have&nbsp;extracted from the web. &nbsp;Of course,&nbsp;this is a gross&nbsp;simplification. &nbsp;Scrapy has support&nbsp;for cookie management, downloading different types of files, multi-server clustering, throttling, server proxies, etc&#8230;<\/p>\n\n\n\n<h2 id=\"installing-scrapy\" class=\"wp-block-heading\">Installing Scrapy<\/h2>\n\n\n\n<p>In theory, installing Scrapy is easy as&nbsp;<code>pip install Scrapy<\/code> &#8211; but there are all sorts of things that can go wrong. &nbsp;I defer to the official <a href=\"https:\/\/doc.scrapy.org\/en\/latest\/intro\/install.html\">install guide<\/a>. &nbsp;Once Scrapy is installed, then you should be able to run&nbsp;<em>scrapy<\/em> at your command line. &nbsp;If you run&nbsp;<code>scrapy createproject ags<\/code>, that will create the necessary folder structure and basic files you&#8217;ll need to create a working spider. &nbsp;In this case,&nbsp;<em>ags<\/em>&nbsp;will be the root folder. &nbsp;Head on over to \/ags\/spiders and create a&nbsp;new file, ags_spider.py.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nfrom scrapy.spiders.crawl import CrawlSpider\n\nclass AmericanGemSpider(CrawlSpider):\n    name = 'ags'\n    allowed_domains = &#x5B;'www.americangemsociety.org']\n    start_urls = &#x5B;'https:\/\/www.americangemsociety.org\/en\/find-a-jeweler']\n<\/pre><\/div>\n\n\n<p>Let&#8217;s break this down. We have defined&nbsp;a new spider, AmericanGemSpider, that extends from Scrapy&#8217;s CrawlSpider. The CrawlSpider class provides some basic functionality for crawling web pages &#8211; if you want to crawl the web, use a CrawlSpider. We&#8217;ve given our spider a name,&nbsp;<em>ags<\/em>&nbsp;&#8211; every spider needs a name. &nbsp;We&#8217;ve then told the spider the domains it&#8217;s allowed to crawl, and the page&nbsp;from which to start crawling.<\/p>\n\n\n\n<p>The&nbsp;<em>allowed_domains<\/em> field restricts what domains our spider is allowed to crawl. &nbsp;Many sites have pages on subdomains, so we need to account for that. &nbsp;Also, you might just want a spider that can crawl over different sites entirely.<\/p>\n\n\n\n<p>The&nbsp;<em>start_urls<\/em>&nbsp;is where our spider will start. &nbsp;We&#8217;re specifying just one page, but this can be a long list of pages. &nbsp;Our first step is&nbsp;pulling out the link for each state; we could just specify those links explicitly here, but it&#8217;s way less typing to give just the main URL.<\/p>\n\n\n\n<h2 id=\"defining-spider-rules\" class=\"wp-block-heading\">Defining Spider Rules<\/h2>\n\n\n\n<p>The next step is to define some rules for how we find links to traverse, and what we do when we find those links. &nbsp;Remember, with AGS, we want to find the state-specific link for each state on the <a href=\"https:\/\/www.americangemsociety.org\/en\/find-a-jeweler\">main page<\/a>; on each <a href=\"https:\/\/www.americangemsociety.org\/en\/newyork-jewelers\">state&#8217;s page<\/a>, we want to find links for each <a href=\"https:\/\/www.americangemsociety.org\/en\/arthur-weeks-son-jewelers\">jeweler&#8217;s page<\/a>; and then when we get to the jeweler pages, we want to extract some data. &nbsp;We accomplished identifying links with XPath in the last post, and we can re-use the same XPath here:<\/p>\n\n\n\n<p><code>state_link_xpath = '\/\/area'<br>\njeweler_link_xpath = '\/\/a[@class=\"jeweler__link\"]'<\/code><\/p>\n\n\n\n<p>With Scrapy, we need to wrap this XPath inside &#8220;link extractors&#8221; which provide the instrumentation for converting XPath into traversable links in Scrapy. &nbsp;Then each of those link extractors is wrapped in a rule, which tells Scrapy what to do with the pages retrieved for each traversed link.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; highlight: [9,10,11,12]; title: ; notranslate\" title=\"\">\nfrom scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor\nfrom scrapy.spiders.crawl import Rule, CrawlSpider\n\nclass AmericanGemSpider(CrawlSpider):\n    name = 'ags'\n    allowed_domains = &#x5B;'www.americangemsociety.org']\n    start_urls = &#x5B;'https:\/\/www.americangemsociety.org\/en\/find-a-jeweler']\n\n    rules = &#x5B;\n        Rule(LxmlLinkExtractor(restrict_xpaths='\/\/area')),\n        Rule(LxmlLinkExtractor(restrict_xpaths='\/\/a&#x5B;@class=\"jeweler__link\"]'), callback='parse_jeweler_page')\n    ]\n<\/pre><\/div>\n\n\n<p>We&#8217;ve assigned a list of Rule objects to our spider&#8217;s&nbsp;<em>rules<\/em> field. &nbsp;Whenever our spider downloads a page, it will use our rules to find links. &nbsp;On the starting page, it will find each of the state links with the first rule. &nbsp;Scrapy will download each of those pages, and then&nbsp;run the rules over those newly downloaded pages. &nbsp;The&nbsp;XPath for the first rule doesn&#8217;t identify any links on those state pages, as expected; the XPath for the second rule does &#8211; it finds all of the jeweler-specific links.<\/p>\n\n\n\n<p>With this second rule, we&#8217;ve specified a&nbsp;<em>callback<\/em> function. &nbsp;When Scrapy downloads the pages for each of the jeweler-specific links, it will hand the response to a new function we&#8217;ve defined, <em>parse_jeweler_page<\/em>, where we can analyze the HTML&nbsp;and extract the data. &nbsp;If we didn&#8217;t specify a callback here, then the default behavior would be to use the rules to find links on the page, and continue traversing. &nbsp;Neither of our rules would find any links on the jeweler pages, so that would effectively terminate the crawl. &nbsp;Given we&#8217;ve specified a callback, we get access to the HTML responses, and can use XPath to extract the data.<\/p>\n\n\n\n<h2 id=\"parsing-the-page\" class=\"wp-block-heading\">Parsing the Page<\/h2>\n\n\n\n<p>Here is the&nbsp;<em>parse_jeweler_page<\/em> method where we convert the HTML into a dictionary of extracted data:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\n    def parse_jeweler_page(self, response):\n        yield {\n            f: response.xpath(s).extract_first() for f, s in {\n                'name': '\/\/h1&#x5B;@class=\"page__heading\"]\/text()',\n                'grading': '\/\/p&#x5B;@class=\"appraiser__grading\"]\/strong\/text()',\n                'certified': '\/\/p&#x5B;@class=\"appraiser__certified\"]\/following-sibling::ul\/\/li\/text()',\n                'address': '\/\/p&#x5B;@class=\"appraiser__hours\"]\/text()',\n                'phone': '\/\/p&#x5B;@class=\"appraiser__phone\"]\/text()',\n                'website': '\/\/p&#x5B;@class=\"appraiser__website\"]\/a\/@href'\n            }.items()\n        }\n<\/pre><\/div>\n\n\n<p>Rather than returning data, this method uses&nbsp;<em>yield<\/em> to hand execution back to the caller. &nbsp;In this example, we are returning one object per response, so there&#8217;s no real distinction between using&nbsp;<em>yield<\/em> versus&nbsp;<em>return<\/em>. &nbsp;In other cases, when there might be multiple objects returned, or if the callback is both&nbsp;extracting data and generating new links to crawl,&nbsp;<em>yield<\/em> makes more sense &#8211; it allows the Scrapy engine to work on the returned information while the spider continues executing.<\/p>\n\n\n\n<p>Here we are yielding a dictionary comprehension. &nbsp;The&nbsp;dictionary returned is&nbsp;a transformation of the dictionary in lines 4-9, which is made up of field names and the corresponding XPath needed to extract that information. &nbsp;The parameter passed to our callback is a wrapped response that exposes an&nbsp;<em>xpath<\/em> method, so we can simply call that on each of our XPaths and get the data we want. &nbsp;This method returns data that looks like:<\/p>\n\n\n\n<p><code>{'website': u'http:\/\/www.jcjewelers.com', 'name': u'J.C. Jewelers', 'grading': u'Offers AGS Labs Diamond Grading Reports', 'phone': u' (307) 733-5933', 'address': u' 132 N Cache St, Jackson, WY 83001-8681', 'certified': u'Jan Case, CGA'}<\/code><\/p>\n\n\n\n<h2 id=\"the-complete-example\" class=\"wp-block-heading\">The Complete Example<\/h2>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nfrom scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor\nfrom scrapy.spiders.crawl import Rule, CrawlSpider\n\nclass AmericanGemSpider(CrawlSpider):\n    name = 'ags'\n    allowed_domains = &#x5B;'www.americangemsociety.org']\n    start_urls = &#x5B;'https:\/\/www.americangemsociety.org\/en\/find-a-jeweler']\n\n    rules = &#x5B;\n        Rule(LxmlLinkExtractor(restrict_xpaths='\/\/area')),\n        Rule(LxmlLinkExtractor(restrict_xpaths='\/\/a&#x5B;@class=\"jeweler__link\"]'), callback='parse_jeweler_page')\n    ]\n\n    def parse_jeweler_page(self, response):\n        yield {\n            f: response.xpath(s).extract_first() for f, s in {\n                'name': '\/\/h1&#x5B;@class=\"page__heading\"]\/text()',\n                'grading': '\/\/p&#x5B;@class=\"appraiser__grading\"]\/strong\/text()',\n                'certified': '\/\/p&#x5B;@class=\"appraiser__certified\"]\/following-sibling::ul\/\/li\/text()',\n                'address': '\/\/p&#x5B;@class=\"appraiser__hours\"]\/text()',\n                'phone': '\/\/p&#x5B;@class=\"appraiser__phone\"]\/text()',\n                'website': '\/\/p&#x5B;@class=\"appraiser__website\"]\/a\/@href'\n            }.items()\n        }\n<\/pre><\/div>\n\n\n<p>That&#8217;s a pretty small amount of code to extract a whole lot of data. &nbsp;You can run the spider by executing <code>scrapy crawl ags<\/code> &#8211; here&#8217;s sample output:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\n(scraper)Craigs-MBP:scraper craigperler$ scrapy crawl americangem2\n2016-11-04 14:10:09 &#x5B;scrapy] INFO: Scrapy 1.0.5 started (bot: boot)\n2016-11-04 14:10:09 &#x5B;scrapy] INFO: Optional features available: ssl, http11\n2016-11-04 14:10:09 &#x5B;scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'boot.spiders', 'CONCURRENT_REQUESTS_PER_DOMAIN': 1, 'CONCURRENT_REQUESTS': 1, 'SPIDER_MODULES': &#x5B;'thuzio.spider', 'crunchbase.spider', 'leafly.spider', 'shopify.spider', 'florida_bar.spider', 'indiegogo.spider', 'uktariff.spider', 'giaalumni.spider', 'americangem.spider', 'australiantpb.spider', 'walmart.spider', 'aidn.spider', 'reddit.spider', 'capropertysearch.spider', 'pokemongomap.spider', 'selfstoragefinders.spider', 'yelp.spider', 'catholicdirectory.spider', 'americangem.flat_spider'], 'BOT_NAME': 'boot', 'USER_AGENT': 'Mozilla\/5.0 (X11; Linux x86_64) AppleWebKit\/53 (KHTML, like Gecko) Chrome\/15.0.87', 'DOWNLOAD_DELAY': 0.5}\n2016-11-04 14:10:09 &#x5B;scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState\n2016-11-04 14:10:09 &#x5B;scrapy] INFO: Enabled downloader middlewares: CrackDistilMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats\n2016-11-04 14:10:09 &#x5B;scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware\n2016-11-04 14:10:09 &#x5B;scrapy] INFO: Enabled item pipelines:\n2016-11-04 14:10:09 &#x5B;scrapy] INFO: Spider opened\n2016-11-04 14:10:09 &#x5B;scrapy] INFO: Crawled 0 pages (at 0 pages\/min), scraped 0 items (at 0 items\/min)\n2016-11-04 14:10:09 &#x5B;scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023\n2016-11-04 14:10:10 &#x5B;scrapy] DEBUG: Crawled (200) &amp;lt;GET https:\/\/www.americangemsociety.org\/en\/find-a-jeweler&amp;gt; (referer: None)\n2016-11-04 14:10:12 &#x5B;scrapy] DEBUG: Crawled (200) &amp;lt;GET https:\/\/www.americangemsociety.org\/en\/nevada-jewelers&amp;gt; (referer: https:\/\/www.americangemsociety.org\/en\/find-a-jeweler)\n2016-11-04 14:10:14 &#x5B;scrapy] DEBUG: Crawled (200) &amp;lt;GET https:\/\/www.americangemsociety.org\/en\/wyoming-jewelers&amp;gt; (referer: https:\/\/www.americangemsociety.org\/en\/find-a-jeweler)\n2016-11-04 14:10:15 &#x5B;scrapy] DEBUG: Crawled (200) &amp;lt;GET https:\/\/www.americangemsociety.org\/en\/500194&amp;gt; (referer: https:\/\/www.americangemsociety.org\/en\/nevada-jewelers)\n2016-11-04 14:10:15 &#x5B;scrapy] DEBUG: Scraped from &amp;lt;200 https:\/\/www.americangemsociety.org\/en\/500194&amp;gt;\n{'website': u'http:\/\/www.tbirdjewels.com', 'name': u'T-Bird Jewels', 'grading': u'Offers AGS Labs Diamond Grading Reports', 'phone': u' (702) 256-3900', 'address': u' 1990 Village Center CirSte P6, Las Vegas, NV 89134-6242', 'certified': u'Jenny O Calleri, CGA'}\n2016-11-04 14:10:19 &#x5B;scrapy] DEBUG: Crawled (200) &amp;lt;GET https:\/\/www.americangemsociety.org\/en\/j-c-jewelers&amp;gt; (referer: https:\/\/www.americangemsociety.org\/en\/wyoming-jewelers)\n2016-11-04 14:10:19 &#x5B;scrapy] DEBUG: Scraped from &amp;lt;200 https:\/\/www.americangemsociety.org\/en\/j-c-jewelers&amp;gt;\n{'website': u'http:\/\/www.jcjewelers.com', 'name': u'J.C. Jewelers', 'grading': u'Offers AGS Labs Diamond Grading Reports', 'phone': u' (307) 733-5933', 'address': u' 132 N Cache St, Jackson, WY 83001-8681', 'certified': u'Jan Case, CGA'}\n...\n<\/pre><\/div>\n\n\n<p>Once the scraper is running, we can see our spider crawl the start_url, start traversing the state-specific pages, and as it finds the jeweler links, it grabs the responses, parses them, and spits out the data. &nbsp;In this example, the data goes nowhere; in practice, we&nbsp;could use Scrapy&#8217;s <a href=\"https:\/\/doc.scrapy.org\/en\/latest\/topics\/item-pipeline.html\">item pipeline<\/a> to persist the objects, or do whatever we want them.<\/p>\n\n\n\n<p>There&#8217;s no reason not to use Scrapy. &nbsp;Whether you&#8217;re working on a simple project (such as scraping AGS), or something&nbsp;massive, Scrapy provides a set of tools that you don&#8217;t need to write yourself. &nbsp;If you need something very customized or specific, you can still start with Scrapy and just replace the components that need it.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>While it&#8217;s all well and good to know how to retrieve pages, parse them, and build a crude web crawler, there&#8217;s no reason to reinvent the wheel given there are&hellip;<\/p>\n","protected":false},"author":1,"featured_media":1589,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[64],"tags":[],"powerkit_post_featured":[],"class_list":{"0":"post-1247","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-application-development"},"jetpack_featured_media_url":"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/11\/scrapy.webp?fit=300%2C300&ssl=1","jetpack_shortlink":"https:\/\/wp.me\/p1SwZ6-k7","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/posts\/1247","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/comments?post=1247"}],"version-history":[{"count":5,"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/posts\/1247\/revisions"}],"predecessor-version":[{"id":1590,"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/posts\/1247\/revisions\/1590"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/media\/1589"}],"wp:attachment":[{"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/media?parent=1247"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/categories?post=1247"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/tags?post=1247"},{"taxonomy":"powerkit_post_featured","embeddable":true,"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/powerkit_post_featured?post=1247"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}