How to Prepare for The Data Incubator

business-1989131_960_720At The Data Incubator, we receive thousands of applications to join our data science fellowship. Our admissions bar is very high and we are often asked:

                        What can I do to prepare for the fellowship application process?

Here are five important skills to develop and some resources on how to help you develop them. While we don’t expect our applicants to possess all of these skills, most applicants already have a strong background in many of them.

  1. Scraping: There’s a lot of data out there so you’ll need to learn how to get access to it. Whether it’s JSON, HTML, or some homebrew format, you should be able to handle them with ease. Modern scripting languages like Python are ideal for this. In Python, look at packages like urllib2, requests, simplejson, re, and beautiful soup to make handling web requests and data formats easier. More advanced topics include error handling (retyring) and parallelization (multiprocessing).
  2. SQL: Once you have a large amount of structured data, you will want to store and process it. SQL is the original query language and its syntax is so prevalent that there are SQL query interfaces for everything from sqldf for R data frames to Hive for MapReduce. Normally, you would have to go through a painful install process to play with SQL. Fortunately, there’s a nice online interactive tutorial available where you can submit your queries and learn interactively. Additionally, Mode Analytics has a great tutorial geared towards data scientists, although it is not interactive. When you’re ready to use SQL locally, SQLite offers a simple-to-install version of SQL.
  3. Data frames: SQL is great for handling large amounts of data but, unfortunately, it lacks machine learning and visualization. So the workflow is often to use SQL or MapReduce to get data to a manageable size and then process it using libraries like R’s data frames or Python’s pandas. For pandas, creator Wes McKinney has a great video tutorial on YouTube. Watch it here and follow along by checking out the github code.
  4. Machine Learning: A lot of data science can be done with select, join, and groupby (or equivalently, map and reduce) but sometimes you need to do some non-trivial machine learning. Before you jump into fancier algorithms, try out simpler algorithms like Naive Bayes and regularized linear regression. In Python, these are implemented in scikit learn. In R, they are implemented in the glm and gbm libraries. You should make sure you understand the basics really well before trying out fancier algorithms.
  5. Visualization: Data science is about communicating your findings, and data visualization is an incredibly valuable part of that. Python offers matlab-like plotting via matplotlib, which is functional even if it’s aesthetically lacking. R offers ggplot, which is prettier. Of course, if you’re really serious about dynamic visualizations, try d3.

 

These are some of the foundational skills that will be invaluable to your career as a data scientist. While they only cover a subset of what we talk about at The Data Incubator (there’s a lot more to cover in stats, machine learning, and MapReduce), this is a great start. And when you’re ready, apply for the fellowship!

 

This article was originally published as a blog post on Software Carpentry, which can be found here.

 

Related Blog Posts

Moving From Mechanical Engineering to Data Science

Moving From Mechanical Engineering to Data Science

Mechanical engineering and data science may appear vastly different on the surface. Mechanical engineers create physical machines, while data scientists deal with abstract concepts like algorithms and machine learning. Nonetheless, transitioning from mechanical engineering to data science is a feasible path, as explained in this blog.

Read More »
Data Engineering Project

What Does a Data Engineering Project Look Like?

It’s time to talk about the different data engineering projects you might work on as you enter the exciting world of data. You can add these projects to your portfolio and show the best ones to future employers. Remember, the world’s most successful engineers all started where you are now.

Read More »
open ai

AI Prompt Examples for Data Scientists to Use in 2023

Artificial intelligence (AI) isn’t going to steal your data scientist job! Instead, AI tools like ChatGPT can automate some of the more mundane tasks in your future career, saving you time and energy. To make life easier, here are some data science prompts to get you started.

Read More »