Download information from all kinds of websites
Interact
tree structure of tags
tags can have attributes
tags can have data
<html> <head> <title>Beyond scraping</title> </head> <body> <a href="http://othersite.com/some/link.html" id="243" class="important">Other site</a> </body> </html>
use a framework
use building blocks
Use urllib2/requests
form data ok
redirection
keeping state
often used to preserve authentication information
browser build in
some form + cookies
OpenID
urllib2/requests of little use
Nicer user experience, quicker updates
Just use a browser and interact with it
Used for testing, but that is easy
Never any discrepancy with what you see as a "normal" user
Wait a reasonable amount of time
Check if some particular piece of data is available
by "id"
by class
programmatically walking the tree
XPATH
there is a re-usable option
div.important ~ a[href^="https://some.site.com/"]
skill useful to have when building websites
beware of restrictions
server keeps browser open even if client doesn't "understand" page structure
send a command with parameters, get result back
faster than using selenium
especially good for large table based reference data
many ways to select data from the HTML site
<html><body>{}</body></html>
virtual window replacing the need for a desktop
Still easy to check by using VNC to virtual window
restrict advertisements by configuration
use Tor network
Not yet at PyPI
Some proprietary stuff needs removing
ruamel.browser.client / ruamel.browser.server