The Data Incubator

Businesses are drowning in data but starving for insights
Forrester

12-Day Program

Everyone is starting out at slightly different levels. We really suggest you set aside some time (approximately 5 hours a day over 12 days) between now and the start of the bootcamp to work through this preparatory material. Some people will be able to work through this material in less time; for others with less previous experience, it may take longer. Past Fellows and Scholars have said they got a lot more out of the bootcamp because they took the time to practice these skills before starting.

Take a look at each day's Action Item(s) and look at the accompanying material. If you don't know it, take a deeper dive and try to understand it. If you already know some things, just focus on things you don't already know. But definitely make sure you do each day's Action Item(s).

In general, if two options are listed, the first one is slightly more preferable but if it doesn't work for you, just use the latter one.

Warning: When working through a coding tutorial, it is really tempting to copy and paste code from the tutorial directly into the Python shell and think you understand it. Don't do this! Force yourself to type the code in character by character: this forces you to learn the commands so that when you need to actually use them, you'll have them closer to your fingertips.

There is a Milestone Project associated with working through this material, which draws on the material you should be learning and involves building a working web app on Day 10. Keep an eye out for Milestone action items. There are some deployment resources available in the "Day 10" tab below.

Day 1: Baseline Computer Science 1

It's crucially important that you become comfortable with the computer science fundamentals that data scientists are expected to know. You should not think of the following materials as necessary to complete in the next two days, but as a starting point. Over the course of the program, if you put in the time, you'll be able to converse fluently even without a CS degree.

Algorithms and data structures are important foundational knowledge for programming. There are a lot of 'dumb' ways to do things. These are the 'smart' ways to do them that will save you a lot of time on the job.

There's a great interactive Python-based tutorial available online. From Basic Data Structures, understand:

  • Lists
  • Linked Lists
  • Stacks
  • Queues
  • Deques
  • Heaps

From Sorting and Searching, understand:

  • Hash Tables
  • The Bubble Sort
  • The Selection Sort
  • The Insertion Sort
  • The Shell Sort
  • The Merge Sort
  • The Quick Sort

Action Item: Take a minute to fill out the Pre-Fellowship Questionnaire.

Action Item: To test your knowledge, make sure you can answer the first five exercises for Basic Data Structures and Sorting and Searching.

Attention: The problems here and in the next section are exactly the kinds of questions that tend to come up in 'programming' job interviews for data science and quant positions. So this is good practice for finding a job. Again, consider doing these exercises with a timer because they will really help you.

Day 2: Baseline Computer Science 2

As you're thinking about these topics and the ones from yesterday, keep time complexity in mind. It's an extremely common question to ask about an algorithm or potential solution. You can find more on wikipedia and use this cheatsheet as reference. The following are some more intermediate topics which often come up.

From Trees and Tree Algorithms, understand:

  • Binary Trees
  • Priority Queues with Binary Heaps
  • Binary Search Trees

From Graphs and Graph Algorithms, understand:

  • Breadth First Search
  • Depth First Search
  • Shortest Path

And, of course, you should already know Recursion and Dynamic Programming (brush up if you don't already).

Action Item: To test your knowledge, make sure you can answer the first five exercises for Recursion, Trees and Tree Algorithms, and Graphs and Graph Algorithms.

Day 3: Python Basics

Python is a great language for data analytics. It offers a lot (although not all) of the tools available in languages like Matlab and R. Unlike Matlab it's free. Unlike both these languages, it promotes good coding practices. But more importantly (for when you start working) it's a real engineering language that makes it easy for you to:

  • Collect data from existing databases using the tools that are currently available in your company.
  • Integrate your code and contributions into the rest of the codebase that your company will use.

You are already familiar with programming; you just have to get familiar with Python’s syntax (if you aren't already) and the numerical and scientific tools available. We are using Python 3.

You will use a prepared Python environment on your cloud server for the program, but you will likely find it useful to have a Python environment on your local machine as well. If you don't have one installed, we recommend the Anaconda distribution. It's free, and it comes with many of the packages we will use already installed. Just be sure to get the Python 3 version.

Action Items:

  1. Get Python running on your local machine. If you don't know where to start, use the Anaconda distribution.

  2. To get started with learning the syntax, we suggest starting with either the Learn python tutorial or the more detailed Google Tutorial. Once you get the basics, this page gives you an amazing number of Python features. Note that it's in Python 2, so it has slightly different syntax in a few places. Features particularly important to data science include:

    • Dictionaries (and converting back and forth to and from lists)
    • Generators (as opposed to lists)
    • List Comprehensions
    • Lambda Functions
    • Decorators
    • Variable number of arguments
    • Classes
  3. Once you’ve gone through the above short tutorials, go to Project Euler and use Python to solve at least 10 problems. Try to choose problems that allow you to practice using dictionaries, list comprehensions, and other Pythonic features.

Day 4: Unix, Bash, Git

Serious software is written in Unix and all the tools that we use are based on it (including git, which helps you manage your code). Being able to navigate the command line and leverage Unix tools makes you a much more powerful data scientist.

  1. Start becoming familiar with Unix by practicing. First you need to get access to a Unix terminal.

    • The cloud server we've provided you is running a Jupyter server on JupyterHub, running Debian-based linux underneath. This runs interactive notebooks, but also contains a full-fledged text editor as well as a web-based bash terminal interface. To start the terminal emulator, you can click My Server (or Start My Server, if you haven't done so yet) > New > Terminal in the upper right corner of a directory listing. We strongly encourage you to use the Jupyter notebooks and terminal to complete the 12-day program. This will get you familiar with working on remote servers - an extremely useful skill for any data scientist.

    • Action Item: Go through this great tutorial on Unix.

  2. Understand the package management system on your local system (not the remote server). Package management systems automate the process of installing, upgrading, configuring and removing software packages from your computer.

    • (OPTIONAL) On OSX people use Brew. If you wish to try to install dev-oriented packages locally, it's worth investigating.

    • On Ubuntu people use apt-get, which comes pre-packaged with the OS.

  3. Git helps programmers manage code so that you can easily try new things and discard them if you don't like them. Follow these instructions to install git and set up a GitHub account. Read about the very basics of git on this page and work through this interactive tutorial. If you're looking for something more visual, check out these GUIs.

  4. Action Item: Complete this fun interactive tutorial on the basics of git and this other one on branches in git.

Project Milestone: Git clone the Flask Demo repository, make and commit a change, then add your personalized version to GitHub. This can be a public repo on your own personal GitHub account.

Day 5: Scientific Computation

Python has a wonderful suite of libraries for doing scientific computing (i.e. vectorized linear algebra). Rather than using slow for-loops, you should always vectorize your operations using numpy and scipy. If you like videos, try watching this one and following along with their GitHub code.

Action Item:

  1. Two excellent tutorials can be found from scipy.org and UCSB. For those of you familiar with Matlab, here’s a cheatsheet for translating your Matlab knowledge into numpy.

  2. Do the Neophyte, Novice, and Apprentice levels from the numpy 100 exercises page.

Day 6: Plotting

matplotlib is Python's plotting library. The library tends to be pretty ugly. Try using the seaborn add-on to make it prettier.

Bokeh is a plotting library for Python that focuses on scalable interactivity.

Action Item:

  1. Go through this basic tutorial on matplotlib as well as this more advanced one.
  2. Go through a few of the Bokeh examples and look at the source code in GitHub.

Project Milestone: Create and display a Bokeh plot using iPython/Jupyter notebooks (guide).

Day 7: Manipulating Data

pandas is an adaptation of R's dataframe objects in Python. If you like videos, try watching this one.

Action Item: Wes McKinney, who created pandas, has a great video tutorial. Watch it here and follow along by checking out the GitHub of his code.

Project Milestone: Manipulate some randomly generated data in a pandas dataframe. Make sure you know how to select a specific column, how to set the dataframe index (and, in particular, how to make it a datetime index), and how to select ranges of rows by index.

Day 8: SQL

You will need to start understanding database basics. Being comfortable with a database to store your data really separates amateur data scientists from professionals.

Normally, you would have to install SQL on your computer to be able to interact with it. Installation can be a little painful. Fortunately, there’s a nice online interactive tutorial available where you can submit your queries electronically. Alternatively, you can also follow this tutorial by Mode Analytics.

Feedback from partners indicates that being knowledgeable about more advanced SQL queries is important for interviewing.

Action Item: Go through the sqlzoo tutorial. By the end of this, you should be able to comfortably SELECT, INSERT, DELETE, and JOIN.

Action Item: Go through the modeanalytics tutorial and make sure you're comfortable writing queries that utilize UNION and CASE WHEN.

Day 9: Regular Expressions and jQuery

A common job is to extract data and structure from messy text or HTML documents.

For parsing semi-regular text, you should use regular expressions. There's a great interactive online tutorial that teaches you how to use them. Regular expressions have been implemented via the re library in Python.

If you want to select items from HTML, you should learn to use jQuery selectors. W3 school has a single-page interactive tutorial to teach you the basics (there's really not that much to it!). jQuery selectors are implemented in Python in a library called Beautiful Soup. It is more robust to malformed HTML (as compared to other libraries like lxml, although it is slower).

Day 10: Flask and Basic Websites

The power of the web comes from building dynamic websites (also known as a "web server"). Your final project will require a basic website to display the results. These usually consist of three main parts:

  1. A database to store the data
  2. A middle layer of code that handles the "business logic" of the website
  3. HTML which is rendered to the user

Amongst the number of "frameworks" for building web servers, Python Flask is probably the simplest to learn.

Milestone Project: Clone the Flask Demo repository and create your own Flask app on Heroku that accepts a stock ticker input from the user and plots closing price data for one month (one of your choosing or perhaps one chosen by the user). You may use any data source you like for this. One example is the Quandl WIKI dataset, which provides this data for free, and you can use Python's Requests library along with simplejson to access it in Python via its API. You can analyze the data using pandas and plot using Bokeh. By the end you should have some kind of interactive visualization viewable from the Internet.

Heroku is a popular cloud application platform, and their documentation is a great resource for understanding how to deploy a simple (free!) server using git.

For more information about Flask, here is a good starting point, especially the links in the "Starting Off" section. This tutorial has great step-by-step explanations.

Note: You may do your development on your local machine or on your remote server, but accessing the running flask server on the remote system is currently very difficult.

When you finish, you should end up with a project that looks like this. To submit, click the link on your dashboard.

Milestone Deployment Tips

Technologies to Use

  • Git: Your repository can live locally or on the remote server we've provided.
  • Heroku: We will deploy our projects to Heroku. Heroku is a good fit because, as deployment technologies go, it's ridiculously simple to use.
  • Flask: The easiest way to get a web server up and running.

Start Your Web App:

To start, simply create a new directory called 'project/' (or whatever you like) and run git init. Follow this tutorial to get a very basic web app up on Heroku.

From here, you can build your project however you like in the context of web apps:

  • Front-end JS displaying interactive views of static results
  • Predictions based on user input and a model you've built

Simply make sure to include the relevant libraries* in requirements.txt, write any Python code you'll need, wire the code up to an HTTP route/display a view, and you'll be good to go!

*If you plan to use numpy/scipy, you need to take one extra step to ensure Heroku installs those libraries. See here for instructions.

Useful to Know:

On Heroku, you can always change your app's name to something else. You should probably name it something descriptive, e.g. '{my-name}-dataincubator-project'. Instructions can be found here.

Is your app working locally but not on Heroku? It could be that it isn't binding to a port. Lines 10-12 of the code shown in this tutorial will make sure that the app listens on the Heroku-provided port.

Brute-force simple steps to deploy Milestone Project using the example git repo:

  1. Clone this directory, which already has the required definition of the right dependencies needed to deploy the flask app in Heroku platform.
  2. Update the existing app.py with your script for stock ticker project (usual milestone project).
  3. Add your templates and style in static directories respectively. You might need to create a static directory to add your style files (.css).
  4. Follow these steps to deploy the Flask web app in Heroku. Note: Instead of $heroku create, use:
    $ heroku create --buildpack https://github.com/thedataincubator/conda-buildpack.git#py3

Day 11: Baseline Statistics and Probability 1

At this point, you should learn or brush up on the basics of stats. We'll talk about the more advanced and practical hands-on topics during the bootcamp. The best book on this is Ross, which is conveniently available online. It's written for probabilists but, as a data scientists, you only really need to understand a subset.

You should understand the basics about distributions (mean, median, standard deviation, variance) and a few slightly more advanced concepts like (Chapter 3):

  • Independence
  • Conditional Probability
  • Conditional Expectation
  • Bayes Rule

It's also important to understand these discrete distributions (Chapter 4):

  • Binomial Distribution (distribution about flipping coins)
  • Geometric Distribution (how long you have to flip a coin before seeing heads)
  • Poisson Distribution (as a 'counting distribution')

Action Item: The best way to check whether you understand these concepts is to solve Self-Test Problems 3.1, 3.9, 3.14, 3.22, and 3.29 (pp. 114–116), as well as 4.3, 4.9, 4.14, 4.16, 4.19, and 4.22 (pp. 183–185).

Action Item: Double check your understanding of distributions by reading through this blog post.

Attention: The problems here and in the next section are exactly the kinds of questions that tend to come up in 'stats' job interviews for data science and quant positions. So this is good practice for finding a job. Keep two things in mind:

  1. Math is not a spectator sport! Glancing over problems doesn't mean you know how to solve them. Force yourself to solve them to make sure you know how.
  2. If you really understand things, these problems should not take you more than 5 minutes each (and most of them 2 minutes at most). Get a timer (find one online or use the one on your cell phone) and time yourself as you solve them.

Day 12: Baseline Statistics and Probability 2

Once again, you may not be able to fully assimilate thes ideas in two days. Consider this a starting point that can guide your interview practice as you go through the program.

It's important to understand these common continuous distributions (Chapter 5):

  • Normal Distribution (additivity, relationship to the Central Limit Theorem)
  • Exponential Distribution (and the memoryless property)

These theorems are worth understanding (Chapter 8):

  • Quantiles
  • Central Limit Theorem
  • Strong Law of Large Numbers

You should be able to give the difference between Strong Law and Central Limit Theorem.

Bayes' Rule is an extremely common topic. Make sure you know it in all its forms and know how to apply it to problems.

Read up on the following concepts:

  • Hypothesis testing, p-values, confidence intervals and how they're related
  • Bootstrapping and jackknifing
  • Maximum likelihood estimation
  • Bayesian estimation

Action Item: The best way to check whether you understand these concepts is to solve Self-Test Problems 5.8, 5.11, 5.13, 5.17, and 5.18 (pp. 230–231), in addition to 8.5, 8.6, 8.7, 8.11, and 8.13 (pp. 415–416).