Show some hands

Who has used urllib/urllib2 from the standard library?

Who has used requests?

Who has used BeautifulSoup?

Who has used selenium?

Who has used zeromq?

Who has used pyvirtualdisplay?

Python and I

computational linguist

managing 3D and 2D computer graphics software development teams

1993 failed to start using Python

1998 started using commercially

ordereddict in C (2007)

YAML 1.2 parser, including round-tripping comments (2014)

What is the problem?

Download information from all kinds of websites
Interact

Why?

What are web pages?

tree structure of tags
tags can have attributes
tags can have data

<html>
 <head>
  <title>Beyond scraping</title>
 </head>

 <body>
  <a href="http://othersite.com/some/link.html" id="243" class="important">Other site</a>
 </body>
</html>

mapping from URL to some data

doesn't have to be unique: form-data/state/JavaScript

Interlude 1: building non-trivial software

use a framework
use building blocks

Simple websites

Use urllib2/requests
form data ok
redirection

Cookies

keeping state
often used to preserve authentication information

Authentication

browser build in
some form + cookies
OpenID

JavaScript

urllib2/requests of little use

what you see in the browser is different

unless you switch off JavaScript

Parts of the tree structure that is HTML is updated/changed by JS

Why?

Nicer user experience, quicker updates

Downsides

Selenium

Just use a browser and interact with it
Used for testing, but that is easy
Never any discrepancy with what you see as a "normal" user

Helps with debugging (built-in, firebug, etc)

as long as your program runs

Superset of urllib2/requests

Opens browser window, needs some desktop

The problems with JavaScript based pages

Never sure when the data is there

Wait a reasonable amount of time
Check if some particular piece of data is available

Interlude 2: selecting a part of webpage

by "id"
by class
programmatically walking the tree
XPATH
there is a re-usable option

CSS select

div.important ~ a[href^="https://some.site.com/"]

skill useful to have when building websites
beware of restrictions

A typical selenium session

Open a browser to some URL

(Wait until redirected to OpenID provider)

Provide credentials

Wait until back at the requested page

Fill out search criteria

Click a matching reference

Retrieve the data (textual or some linked file (PDF))

Debugging the above can involve a lot of waiting time

Client-Server

server keeps browser open even if client doesn't "understand" page structure

protocol

data to and from server

Zero MQ

Many to one

Server can run on a different machine

Unicode based exchanges, easy to get data

A typical client-server based session

Open a browser to some URL, if not already there

(Wait until redirected to OpenID) (if not logged in)

Provide credentials (if not logged in)

Wait until back at the requested page (if not logged in)

Fill out search criteria (if not at the result page yet)

Click a matching reference (if not already clicked)

Retrieve the data (textual or some linked file (PDF))

Debugging the above goes very fast

What protocol functions are needed?

send a command with parameters, get result back

Open a window, by unique id (wid)

Goto URL (wid)

Select some item (iid) on the page (wid)

Click some item (iid)

Clear input/textarea (iid)

Type some text in item (iid)

return HTML under item (iid)

return current URL (wid)

Whatever makes things more efficient

BeautifulSoup4

faster than using selenium
especially good for large table based reference data
many ways to select data from the HTML site

CSS select support

Works on complete HTML pages, therefore insert page parts in stub

<html><body>{}</body></html>

pyvirtualdisplay (vnc)

virtual window replacing the need for a desktop
Still easy to check by using VNC to virtual window

Extend

restrict advertisements by configuration
use Tor network

Availability

Not yet at PyPI
Some proprietary stuff needs removing
ruamel.browser.client / ruamel.browser.server

Real world usage

Picks up bank account statements that otherwise would be sent by mail

Picks up statements from card used for pumping gas, and renames (all named Document.pdf)

Stackoverflow: YAML question notification

Stackexchange U&L: "competing" at the review queues

download funny pictures