On Web Scraping

SONY DSC

Web scraping is the art of programmatically extracting content from the web.

For those entirely unfamiliar with the concept, consider this example.  It was theorized that doctors who received payments from the medical industry prescribed differently than those who did not.  If your doctor is recommending some random drug, it’d be good to know whether doc is being paid off by Big Pharma, right?  Well, as required by the 2010 Physician Payment Sunshine Act, pharmaceutical companies must file when they pay doctors for promotions, research, and other reasons.  If you consider that there are 680,000+ doctors and 1500+ drug companies, the number of possible filings is massive.

As it so happens, from August 2013 to December 2014, doctors (and teaching hospitals) were paid 14,837,291 times, representing a collective amount of $3.49B.  Each of those payments was filed separately, and each record is made accessible by the Centers for Medicare & Medicaid Services.  You can search for this stuff pretty easily, one doctor or drug company at a time.

There is indeed strong proof that doctors who receive payouts from drug companies tend to prescribe differently.  By programmatically downloading and cataloging each of those 14.8 million records – by scraping the data set – the company ProPublica was able to see the big picture, find the trends, and made this information transparent to the public.

There are a lot of articles (eg.) about how scraping is a good tool for journalists, as it was in the above.  There are countless other uses for scraping, to name a few:

  • Scraping job posts to build a job aggregator or search board.
  • Scraping news or social media to generate insight and signals for stock trading ideas.
  • Scraping competitors’ prices to show as comparative pricing on your own site.
  • Scraping feedback from comments on sites like Yelp to targeted generate leads.

It’s a pretty open-concept tool, and there are many more ideas on this Quora page.

That said, scraping data has a few legal and ethical quirks.  First, many sites explicitly prohibit scraping.  For example, from Craigslist’s terms of use:

cl_terms

Specifically…

“…[no] robots, spiders, scripts, scrapers, crawlers, etc…”

…means Craigslist does not want you programmatically accessing its data.  In fact, Craigslist has taken legal action (and won) against companies that have violated these terms.  The fact is, companies spend a lot of resources building up their datasets; in many cases, perhaps most, access to that data is provided, expecting people won’t take advantage of it.  The value of the Craigslist experience entirely hinges on its listings – its dataset.  If one were to copy that data and provide it elsewhere, the value of the Craigslist experience would be diminished (to some degree).

Frankly, web scraping falls in legally ambiguous territory.  Breaking terms of use might get you banned from a website, but it’s not necessarily illegal or breaking a law.  There may be some copyright caveats to consider, but, in general, breaking terms of use is not a criminal act.  On the other hand, scraping presents an ethical and moral dilemma.  Is it fair to benefit from someone else’s hard work?  LinkedIn has spent years building an infrastructure, user base, and brand that we’ve come to know as a credible way for storing our professional resumes and networks.  If one were to scrape every LinkedIn profile that’s publicly accessible, would that be equivalent to stealing?  What if you only copied every 10th profile?  Is that really any better?

Any search engine, such as The Google, works by crawling and scraping the web, analyzing what page links to what other pages, and building a giant index representing those relationships.  Of course, there’s all the intelligence layered on top of that, but at the lowest levels, Google needs to scrape the public web to know what is searchable.  Google provides a number of methods by which a website owner can opt out of the Google crawler and search process, but the onus is on the website owner, not Google, to opt out, rather than to opt in.  If Google is able to profit from web scraping, it can’t be all that bad, right?

Web scraping is an ethical grey area.  Some opinions draw a line between personal and commercial use of the scraped data; others disagree with even that.  If a company is employing significant counter-scraping technologies, would you expect they’re ok with someone harvesting their data even for personal use?puzzle-2-1197936

Whatever your opinion, I view web scraping as something of a puzzle – an intellectual challenge.  Excavating data requires a mix of programming, pattern recognition, and psychology.  To scrape a website, you need to understand how the data and flow of information is organized; you need to mentally reverse engineer the structure.  The more complex methods require experimenting – poke the box and see what happens, so to speak.  Web scraping is a nerdy treasure hunt whereby even if you don’t care about the data itself, the experience alone is often interesting in and of itself.

So, I’m considering putting up some tutorials about how to approach web scraping, starting from the basics, and scaling into the very difficult (perhaps such as Google’s CAPTCHAs and the Distil Networks’ bot detection).  Ethics aside, scraping, just getting access to the data, can make for a pretty interesting case study in engineering.

2 Comments

  1. Pingback: Web Scraping, Part 1 – That's a Big Idea

  2. Pingback: Web Scraping, Part 3 – Case Study #1 – That's a Big Idea

Leave a Reply

Your email address will not be published. Required fields are marked *