Skip to content

rberenguel/pyspark_workshop

Repository files navigation

Workshop for PyDay BCN 2019

These are the slides and notebook for a workshop The magic of PySpark for PyDay Barcelona 2019.

You can find the slides here (some images might look slightly blurry). I recommend you check the version with presenter notes.


This presentation is formatted in Markdown and prepared to be used with Deckset. The drawings were done on an iPad Pro using Procreate. Here only the final PDF and the source Markdown are available. Sadly the animated gifs are just static images in the PDF.


You can run the notebook in Binder

Binder

Note: The Arrow optimisation does not work in Binder. I'll try to fix it, but it won't be ready for the workshop. Check the output notebook to see the impact of these.

Or just read it here in Github by clicking here (no outputs) or here (with outputs)


Thanks


Details on running the notebook

To take full advantage of the workshop without using Binder (locally) you'll need

  • PySpark installed (anything more recent than 2.3 should be fine)
  • Jupyter installed
  • Pandas and Arrow installed
  • All able to talk to each other
  • One or more datasets

The TL;DR if you don't want to use Docker should just be:

pip install pyarrow pandas pyspark numpy jupyter

You can install pyspark using pip install pyspark, installing it in the same environment you have Jupyter installed should make them talk to each other just fine. You should also run pip install pyarrow, although if this one fails for some reason it's not a big problem. To make analysis more entertaining, also run pip install pandas, again, all in the same environment. You can also run these in conda, with conda install -c conda-forge pyspark although it might be more convenient to use pip (pyspark can get easily confused with many python environments)


If you are familiar enough with Docker, I recommend using a Docker container instead.

Run this before the workshop:

docker pull rberenguel/pyspark_workshop

During the workshop (or before) you can use this docker container with

docker run --name pyspark_workshop -d -p 8888:8888 -p 4040:4040 -p 4041:4041 -v "$PWD":/home/jovyan rberenguel/pyspark_workshop

in the folder you want to create your Jupyter notebook. To open it in your browser,

docker logs pyspark_workshop 

and open the URL provided in the logs (should look like http://127.0.0.1:8888/?token=36a20c93f0ee8cab4699e2460261e3b16787a68fbb034aee)

This container installs Arrow on top of the usual jupyter/pyspark, to allow for some additional optimisations in Spark.


About

Workshop for PyDay Barcelona 2019

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages