EuroPython 2016

PySpark and Warcraft Data

Speaker(s) vincent warmerdam

In this talk I will describe how to use Apache Spark (PySpark) with some data from the World of Warcraft API from an iPython notebook. Spark is interesting because it speeds up iterative processes on your hadoop cluster as well as your local machine.

I will give basic benchmarks (comparing it to numpy/pandas/scikit), explain the architecture/performance behind the technology and will give a live demo on how I used Spark to analyse an interesting dataset. I’ll explain why you might want to use Spark and I’ll also go in and explain when you don’t want to use it.

The dataset I will be using is a 22Gb json blob containing auction house data from all world of warcraft servers over a period of time. The goal of the analysis will be to determine when and if basic economics still applies in a massively online game.

I will assume that the everyone knows what the ipython notebook is and I will assume a basic knowledge of numpy/pandas but nothing fancy. The dataset has been chosen such that people who are less interested in Spark can still enjoy the analysis part of the talk. If you know very little about data science but if you love video games then you should like this talk.

Video


Do you have some questions on this talk?

New comment