Introduction to Screen Scraping

Eric Rochester, Scholars’ Lab

bit.ly/erochest-screen-scraping

About the Scholars’ Lab

Introduction to Screen Scraping

For Today

  • Overview
  • Process
  • A few tools
  • A quick example

What is It?

Caveats

Not Always Possible

Levels of Accessibility

Downloadable

  • Comma-separated-values
  • JSON
  • Excel

HTML

Inaccessible

Looking at you, Flash

And you, PDF

Tools

Python

http://www.python.org/

Requests

http://docs.python-requests.org/en/latest/

lxml or BeautifulSoup

Or Almost Any Other Programming Language

Your Browser!

Chrome and Firefox both come with lots of tools.

The Example Today

Internet Users

http://ancient-shore-4835.herokuapp.com/

How to Approach the Problem

Think about How You Get to the Data

Duplicate That!

Look for One Entry Point

Branch off

Let’s Explore

Main Page

Form to Subdivide the Data

Data in Table

Paginated

I Want to Go There

Not a Tutorial on Python

Download All the Things!

def get(url, params=None):
    """A basic utility to get and parse a web page."""
    req = requests.get(url, params=params)
    doc = lxml.html.fromstring(req.text)
    return doc

World Traveller

def get_countries(base_url, doc):
    """Takes the document and returns (country, country_value)."""
    for select in doc.cssselect('form select'):
        if select.get('name') == COUNTRY:
            for option in select.cssselect('option'):
                yield (option.text, option.get('value'))

On the Table

def get_country_data(url, country_code, page=0):
    """Page through the data for one country."""
    doc = get(url, {PAGE: page, COUNTRY: country_code})

    # Get the data for the current page, counting it as we go.
    n = 0
    for table_row in doc.cssselect('table tbody tr'):
        n += 1
        yield tuple( td.text for td in table_row.cssselect('td') )

    # If this page has data, see if the next does too.
    if n > 0:
        for row in get_country_data(url, country_code, page + 1):
            yield row

But That’s not All

See the full source.

Where to Go from Here

Questions?