Introduction to Screen Scraping

Eric Rochester, Scholars’ Lab

bit.ly/erochest-screen-scraping

About the Scholars’ Lab

Introduction to Screen Scraping

For Today

Overview
Process
A few tools
A quick example

What is It?

Caveats

Legal Issues

Read the fine print.

Wikipedia has a nice overview on the Web Scraping article.

Not Always Possible

Levels of Accessibility

Downloadable

Comma-separated-values
JSON
Excel

HTML

Inaccessible

Looking at you, Flash

And you, PDF

Tools

Python

http://www.python.org/

Requests

http://docs.python-requests.org/en/latest/

lxml or BeautifulSoup

Or Almost Any Other Programming Language

Your Browser!

Chrome and Firefox both come with lots of tools.

The Example Today

Internet Users

http://ancient-shore-4835.herokuapp.com/

How to Approach the Problem

Think about How You Get to the Data

Duplicate That!

Look for One Entry Point

Branch off

Let’s Explore

Main Page

Form to Subdivide the Data

Data in Table

Paginated

I Want to Go There

Not a Tutorial on Python

Download All the Things!

def get(url, params=None):
    """A basic utility to get and parse a web page."""
    req = requests.get(url, params=params)
    doc = lxml.html.fromstring(req.text)
    return doc

World Traveller

def get_countries(base_url, doc):
    """Takes the document and returns (country, country_value)."""
    for select in doc.cssselect('form select'):
        if select.get('name') == COUNTRY:
            for option in select.cssselect('option'):
                yield (option.text, option.get('value'))

On the Table

def get_country_data(url, country_code, page=0):
    """Page through the data for one country."""
    doc = get(url, {PAGE: page, COUNTRY: country_code})

    # Get the data for the current page, counting it as we go.
    n = 0
    for table_row in doc.cssselect('table tbody tr'):
        n += 1
        yield tuple( td.text for td in table_row.cssselect('td') )

    # If this page has data, see if the next does too.
    if n > 0:
        for row in get_country_data(url, country_code, page + 1):
            yield row

Introduction to Screen Scraping

About the Scholars’ Lab

Introduction to Screen Scraping

For Today

What is It?

Caveats

Legal Issues

Not Always Possible

Levels of Accessibility

Downloadable

HTML

Inaccessible

Tools

Python

Requests

lxml or BeautifulSoup

Or Almost Any Other Programming Language

Your Browser!

The Example Today

Internet Users

How to Approach the Problem

Think about How You Get to the Data

Duplicate That!

Look for One Entry Point

Branch off

Let’s Explore

Main Page

Form to Subdivide the Data

Data in Table

Paginated

I Want to Go There

Not a Tutorial on Python

Download All the Things!

World Traveller

On the Table

But That’s not All

Where to Go from Here

This Presentation (with Links)

More Links

Questions?