Holden Karau presented a super fast introduction to PySpark – how to use Python and Spark together when you exceed the limitations of a single machine. Apache Spark is a fast and general engine for distributed computing & big data processing with APIs in Scala, Java, Python, and R. This tutorial will briefly introduce PySpark (the Python API for Spark) with some hands-on-exercises combined with a quick introduction to Spark’s core concepts. We will cover the obligatory wordcount example which comes in with every big-data tutorial, as well as discuss Spark’s unique methods for handling node failure and other relevant internals.
About the speakers:
Holden Karau is transgender Canadian, and an active open source contributor. When not in San Francisco working as a software development engineer at IBM’s Spark Technology Center, Holden talks internationally on Spark and holds office hours at coffee shops at home and abroad. Holden is a co-author of numerous books on Spark including High Performance Spark (which she believes is the gift of the season for those with expense accounts) & Learning Spark. She is a Spark committer and makes frequent contributions, specializing in PySpark and Machine Learning. Prior to IBM she worked on a variety of distributed, search, and classification problems at Alpine, Databricks, Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a Bachelor of Mathematics in Computer Science. Outside of software she enjoys playing with fire, welding, scooters, poutine, and dancing.
Michael Li founded The Data Incubator, a New York-based training program that turns talented PhDs from academia into workplace-ready data scientists and quants. The program is free to Fellows, employers engage with the Incubator as hiring partners. Previously, he worked as a data scientist (Foursquare), Wall Street quant (D.E. Shaw, J.P. Morgan), and a rocket scientist (NASA). He completed his PhD at Princeton as a Hertz fellow and read Part III Maths at Cambridge as a Marshall Scholar. At Foursquare, Michael discovered that his favorite part of the job was teaching and mentoring smart people about data science. He decided to build a startup to focus on what he really loves. Michael lives in New York, where he enjoys the Opera, rock climbing, and attending geeky data science events.