{"id":1147,"date":"2016-10-24T19:00:13","date_gmt":"2016-10-24T23:00:13","guid":{"rendered":"http:\/\/www.craigperler.com\/blog\/?p=1147"},"modified":"2024-06-06T23:25:56","modified_gmt":"2024-06-07T03:25:56","slug":"on-web-scraping","status":"publish","type":"post","link":"https:\/\/www.craigperler.com\/blog\/2016\/10\/24\/on-web-scraping\/","title":{"rendered":"On Web Scraping"},"content":{"rendered":"<div class=\"wp-block-image\">\n<figure class=\"alignleft\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"657\" height=\"440\" src=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/wall-1.jpg?resize=657%2C440&#038;ssl=1\" alt=\"SONY DSC\" class=\"wp-image-1153\" srcset=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/wall-1.jpg?w=657&amp;ssl=1 657w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/wall-1.jpg?resize=120%2C80&amp;ssl=1 120w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/wall-1.jpg?resize=90%2C60&amp;ssl=1 90w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/wall-1.jpg?resize=320%2C214&amp;ssl=1 320w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/wall-1.jpg?resize=560%2C375&amp;ssl=1 560w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/wall-1.jpg?resize=300%2C201&amp;ssl=1 300w\" sizes=\"auto, (max-width: 657px) 100vw, 657px\" \/><\/figure>\n<\/div>\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p><strong>Web scraping is the art of&nbsp;programmatically extracting content from the web.<\/strong><\/p><\/blockquote>\n\n\n\n<p>For those entirely unfamiliar with the concept of web scraping, consider this example. \u00a0It was theorized that doctors who received payments from the medical industry prescribed differently than those who did not. \u00a0If your doctor is recommending some random drug, it&#8217;d be good to know <a href=\"http:\/\/www.thehealthyhomeeconomist.com\/is-your-doctor-getting-drug-kickbacks\/\">whether doc is being paid off by Big Pharma<\/a>, right? \u00a0Well,\u00a0as required by\u00a0the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Physician_Payments_Sunshine_Act\">2010 Physician Payment Sunshine Act<\/a>, pharmaceutical companies must file\u00a0when they pay doctors for promotions, research, and other reasons. \u00a0If you consider that there are 680,000+\u00a0doctors and 1500+ drug companies, the number of possible filings\u00a0is massive.<\/p>\n\n\n\n<p>As it so happens, from August 2013 to December 2014, doctors (and teaching hospitals) were paid 14,837,291 times, representing a collective amount of <span style=\"color: #339966;\"><strong>$3.49B<\/strong><\/span>. &nbsp;Each of those payments was filed separately, and each record is made accessible by the&nbsp;<a href=\"https:\/\/openpaymentsdata.cms.gov\/search\">Centers for Medicare &amp; Medicaid Services<\/a>. &nbsp;You can search for this stuff pretty easily, one doctor or drug company at a time.<\/p>\n\n\n\n<p>There is indeed strong proof that doctors who receive payouts from drug companies tend to prescribe differently. &nbsp;By programmatically downloading and cataloging each of those 14.8 million records &#8211; <em>by <span style=\"text-decoration: underline;\">scraping<\/span> the data set<\/em> &#8211; the company ProPublica was able to see the big picture, find the trends, and made this information <a href=\"https:\/\/projects.propublica.org\/docdollars\/\">transparent to the public<\/a>.<\/p>\n\n\n\n<h2 id=\"many-many-use-cases\" class=\"wp-block-heading\">Many, Many Use Cases!<\/h2>\n\n\n\n<p>There are a lot of articles (<a href=\"https:\/\/onlinejournalismblog.com\/2012\/08\/09\/two-reasons-why-every-journalist-should-know-about-scraping-cross-posted\/\">eg.<\/a>) about how scraping is a good tool for journalists, as it was in the above. &nbsp;There are countless other uses for scraping, to name a few:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Scraping job posts to build a job aggregator or <a href=\"https:\/\/www.craigperler.com\/blog\/2016\/10\/21\/projectsherpa-a-startup-retrospective\/\">search board<\/a>.<\/li><li>Scraping news or social media&nbsp;to generate insight and signals for stock trading ideas.<\/li><li>Scraping competitors&#8217; prices to show as&nbsp;comparative pricing on your own site.<\/li><li>Scraping feedback from comments on sites like Yelp to targeted generate leads.<\/li><\/ul>\n\n\n\n<p>It&#8217;s a pretty open-concept tool, and there are many more ideas on this <a href=\"https:\/\/www.quora.com\/What-are-examples-of-how-real-businesses-use-web-scraping\">Quora page<\/a>.<\/p>\n\n\n\n<h2 id=\"ethical-implications-of-web-scraping\" class=\"wp-block-heading\">Ethical Implications of Web Scraping<\/h2>\n\n\n\n<p>That said, scraping data has a few\u00a0legal and ethical quirks. \u00a0First, many sites explicitly prohibit scraping. \u00a0For example, from <a href=\"https:\/\/www.craigslist.org\/about\/terms.of.use.en\">Craigslist&#8217;s terms of use<\/a>:<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"149\" src=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/cl_terms.png?resize=1024%2C149&#038;ssl=1\" alt=\"cl_terms\" class=\"wp-image-1149\" srcset=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/cl_terms.png?resize=1024%2C149&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/cl_terms.png?resize=300%2C44&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/cl_terms.png?resize=768%2C112&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/cl_terms.png?resize=1440%2C209&amp;ssl=1 1440w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/cl_terms.png?w=1596&amp;ssl=1 1596w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n<\/div>\n\n\n<p>Specifically&#8230;<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>&#8220;&#8230;[no]&nbsp;robots, spiders, scripts, scrapers, crawlers, etc&#8230;&#8221;<\/p><\/blockquote>\n\n\n\n<p>&#8230;means Craigslist does not want you programmatically accessing its data. &nbsp;In fact, Craigslist <a href=\"http:\/\/3taps.com\/the-craigslist-lawsuit.php\">has taken legal action<\/a> (and won) against companies that have violated these terms. &nbsp;The fact is, companies spend a lot of resources building up their datasets; in many cases, perhaps most, access to that data is provided, expecting people won&#8217;t take advantage of it. &nbsp;The value of the Craigslist experience entirely hinges on its listings &#8211; its dataset. &nbsp;If one were to copy that data and provide it elsewhere,&nbsp;the value of the Craigslist experience would be diminished (to some degree).<\/p>\n\n\n\n<p>Frankly,&nbsp;web scraping falls in legally ambiguous territory. &nbsp;Breaking terms of use might get you banned from a website, but it&#8217;s&nbsp;not necessarily illegal or breaking a law. &nbsp;There may be some copyright caveats to consider, but, in general, <a href=\"http:\/\/lifehacker.com\/5901773\/breaking-a-terms-of-service-isnt-necessarily-a-crime\">breaking terms of use is not a criminal act<\/a>. &nbsp;On the other hand, scraping presents an ethical and moral dilemma. &nbsp;Is it fair to benefit from someone else&#8217;s hard work? &nbsp;LinkedIn has spent years building an infrastructure, user base, and brand that we&#8217;ve come to know as a credible way for storing our professional resumes and networks. &nbsp;If one were to&nbsp;scrape every LinkedIn profile that&#8217;s&nbsp;publicly accessible, would that be equivalent to stealing? &nbsp;What if you only copied every 10th profile? &nbsp;Is that really any better?<\/p>\n\n\n\n<h2 id=\"web-scraping-and-search-engines\" class=\"wp-block-heading\">Web Scraping and Search Engines<\/h2>\n\n\n\n<p>Any search engine, such as The Google, works by crawling and\u00a0scraping\u00a0the web, analyzing what page links to what other pages, and building a giant index\u00a0representing those relationships. \u00a0Of course, there&#8217;s all the intelligence layered on top of that, but at the lowest levels, Google needs to scrape the public web to know what is searchable. \u00a0Google provides a number of methods by which a website owner can opt out of the Google crawler and search process, but the onus is on the website owner, not Google, to opt out, rather than to opt in. \u00a0If Google is able to profit from web scraping, it can&#8217;t be all that bad, right?<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"alignright\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"657\" height=\"439\" src=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/puzzle-2-1197936.jpg?resize=657%2C439&#038;ssl=1\" alt=\"puzzle-2-1197936\" class=\"wp-image-1160\" srcset=\"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/puzzle-2-1197936.jpg?w=657&amp;ssl=1 657w, https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/puzzle-2-1197936.jpg?resize=300%2C200&amp;ssl=1 300w\" sizes=\"auto, (max-width: 657px) 100vw, 657px\" \/><\/figure>\n<\/div>\n\n\n<p>Web scraping is an ethical grey area. &nbsp;Some opinions draw a line between personal and commercial use of the scraped data; others disagree with even that. &nbsp;If a company is employing significant counter-scraping technologies, would you expect they&#8217;re ok with someone harvesting their data even for personal use?<\/p>\n\n\n\n<h2 id=\"web-scraping-as-an-intellectual-puzzle\" class=\"wp-block-heading\">Web Scraping as an Intellectual Puzzle<\/h2>\n\n\n\n<p>Whatever your opinion, I view web scraping as something of a puzzle &#8211; an intellectual challenge. &nbsp;Excavating data requires a mix of programming, pattern recognition, and psychology. &nbsp;To scrape a website,&nbsp;you need to understand how the data and flow of information is organized; you need to mentally reverse engineer the structure. &nbsp;The more complex methods require experimenting &#8211; poke the box and see what happens, so to speak. &nbsp;Web scraping is a nerdy treasure hunt whereby even if you don&#8217;t care about the data itself, the experience alone is often interesting in and of itself.<\/p>\n\n\n\n<p>So, I&#8217;m considering putting up some tutorials about how to approach web scraping, starting from the basics, and scaling into&nbsp;the very difficult&nbsp;(perhaps such as&nbsp;<a href=\"https:\/\/www.google.com\/recaptcha\/intro\/index.html\">Google&#8217;s CAPTCHAs<\/a>&nbsp;and the <a href=\"http:\/\/www.distilnetworks.com\/\">Distil Networks&#8217; bot detection<\/a>). &nbsp;Ethics aside, scraping, just getting access to the data, can make for a pretty interesting case study in engineering.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Web scraping is the art of&nbsp;programmatically extracting content from the web. For those entirely unfamiliar with the concept of web scraping, consider this example. \u00a0It was theorized that doctors who&hellip;<\/p>\n","protected":false},"author":1,"featured_media":1153,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[64],"tags":[],"powerkit_post_featured":[],"class_list":{"0":"post-1147","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-application-development"},"jetpack_featured_media_url":"https:\/\/i0.wp.com\/www.craigperler.com\/blog\/wp-content\/uploads\/2016\/10\/wall-1.jpg?fit=657%2C440&ssl=1","jetpack_shortlink":"https:\/\/wp.me\/p1SwZ6-iv","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/posts\/1147","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/comments?post=1147"}],"version-history":[{"count":4,"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/posts\/1147\/revisions"}],"predecessor-version":[{"id":1554,"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/posts\/1147\/revisions\/1554"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/media\/1153"}],"wp:attachment":[{"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/media?parent=1147"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/categories?post=1147"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/tags?post=1147"},{"taxonomy":"powerkit_post_featured","embeddable":true,"href":"https:\/\/www.craigperler.com\/blog\/wp-json\/wp\/v2\/powerkit_post_featured?post=1147"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}