Beyond scraping

Getting data from dynamic, heavily javascript driven, websites

Show some hands

Who has used urllib/urllib2 from the standard library?

Who has used requests?

Who has used BeautifulSoup?

Who has used selenium?

Who has used zeromq?

Who has used pyvirtualdisplay?

Python and I

computational linguist

managing 3D and 2D computer graphics software development teams

1993 failed to start using Python

1998 started using commercially

ordereddict in C (2007)

YAML 1.2 parser, including round-tripping comments (2014)

What is the problem?

Why?

What are web pages?

<html>
 <head>
  <title>Beyond scraping</title>
 </head>

 <body>
  <a href="http://othersite.com/some/link.html" id="243" class="important">Other site</a>
 </body>
</html>

mapping from URL to some data

doesn't have to be unique: form-data/state/JavaScript

Interlude 1: building non-trivial software

Simple websites

Cookies

Authentication

JavaScript

what you see in the browser is different

unless you switch off JavaScript

Parts of the tree structure that is HTML is updated/changed by JS

Why?

Downsides

Selenium

Helps with debugging (built-in, firebug, etc)

as long as your program runs

Superset of urllib2/requests

Opens browser window, needs some desktop

The problems with JavaScript based pages

Never sure when the data is there

Interlude 2: selecting a part of webpage

CSS select

div.important ~ a[href^="https://some.site.com/"]

A typical selenium session

Open a browser to some URL

Click login button

(Wait until redirected to OpenID provider)

Provide credentials

Wait until back at the requested page

Fill out search criteria

Click a matching reference

Retrieve the data (textual or some linked file (PDF))

Debugging the above can involve a lot of waiting time

Client-Server

protocol

data to and from server

Zero MQ

Many to one

Server can run on a different machine

Unicode based exchanges, easy to get data

A typical client-server based session

Open a browser to some URL, if not already there

Click login button, if not already logged in

(Wait until redirected to OpenID) (if not logged in)

Provide credentials (if not logged in)

Wait until back at the requested page (if not logged in)

Fill out search criteria (if not at the result page yet)

Click a matching reference (if not already clicked)

Retrieve the data (textual or some linked file (PDF))

Debugging the above goes very fast

What protocol functions are needed?

Open a window, by unique id (wid)

Goto URL (wid)

Select some item (iid) on the page (wid)

Click some item (iid)

Clear input/textarea (iid)

Type some text in item (iid)

return HTML under item (iid)

return current URL (wid)

Whatever makes things more efficient

BeautifulSoup4

CSS select support

Works on complete HTML pages, therefore insert page parts in stub

<html><body>{}</body></html>

pyvirtualdisplay (vnc)

Extend

Availability

Real world usage

Picks up bank account statements that otherwise would be sent by mail

Picks up statements from card used for pumping gas, and renames (all named Document.pdf)

Stackoverflow: YAML question notification

Stackexchange U&L: "competing" at the review queues

download funny pictures

Questions?