Introduction to Screen Scraping

Eric Rochester, Scholars’ Lab

About the Scholars’ Lab

Introduction to Screen Scraping

For Today

  • Overview
  • Process
  • A few tools
  • A quick example

What is It?


Not Always Possible

Levels of Accessibility


  • Comma-separated-values
  • JSON
  • Excel



Looking at you, Flash

And you, PDF




lxml or BeautifulSoup

Or Almost Any Other Programming Language

Your Browser!

Chrome and Firefox both come with lots of tools.

The Example Today

Internet Users

How to Approach the Problem

Think about How You Get to the Data

Duplicate That!

Look for One Entry Point

Branch off

Let’s Explore

Main Page

Form to Subdivide the Data

Data in Table


I Want to Go There

Not a Tutorial on Python

Download All the Things!

def get(url, params=None):
    """A basic utility to get and parse a web page."""
    req = requests.get(url, params=params)
    doc = lxml.html.fromstring(req.text)
    return doc

World Traveller

def get_countries(base_url, doc):
    """Takes the document and returns (country, country_value)."""
    for select in doc.cssselect('form select'):
        if select.get('name') == COUNTRY:
            for option in select.cssselect('option'):
                yield (option.text, option.get('value'))

On the Table

def get_country_data(url, country_code, page=0):
    """Page through the data for one country."""
    doc = get(url, {PAGE: page, COUNTRY: country_code})

    # Get the data for the current page, counting it as we go.
    n = 0
    for table_row in doc.cssselect('table tbody tr'):
        n += 1
        yield tuple( td.text for td in table_row.cssselect('td') )

    # If this page has data, see if the next does too.
    if n > 0:
        for row in get_country_data(url, country_code, page + 1):
            yield row

But That’s not All

See the full source.

Where to Go from Here