Succeed with our Data Prep Materials
Everyone starts out at different levels when they enter TDI’s data science or data engineering programs. To help get everyone on the same page, we’ve put together these prep materials to ensure you feel fully prepared from day one of your program.
Past alumni have said that they were able to get more out of their programs because they took the time to practice these skills before starting the bootcamps.
Below, we’ve put together five modules that will give you a taste of what our programs are like, and provide you with an understanding on what you should know before starting your data journey in our bootcamp.
Each of these modules should take about five hours to work through, give or take depending on your past experience.
Make sure you look at each module’s Action Items and the accompanying materials. Take deeper dives into the topics you’re not familiar with, and make sure you’re set up for success across all the modules.
Warning!
When working through a coding tutorial, it can be really tempting to copy and paste code from the tutorial directly into the Python shell and think you understand it.
Don’t do this! Type the code in character by character: this forces you to learn the commands so that when you need to actually use them, you’ll have them closer to your fingertips.
Things to note: These are the first five modules of our 12-module pre-course plan. When you’re accepted to the program, you’ll see these prep materials, along with seven other modules, to help you succeed in the program.
Python is a great language for data analytics. It offers a lot of the tools available in languages like Matlab and R. Unlike Matlab, it's free. Unlike both these languages, it promotes good coding practices. But more importantly (for when you start working), it's a real engineering language that makes it easy for you to:
- Collect data from existing databases using the tools that are currently available in your company.
- Integrate your code and contributions into the rest of the codebase that your company will use.
You are already familiar with programming; you just have to get familiar with Python’s syntax (if you aren't already) and the numerical and scientific tools available. We are using Python 3.
You will use a prepared Python environment on a cloud server for our program, but you will likely find it useful to have a Python environment on your local machine as well. If you don't have one installed, we recommend the Anaconda distribution. It's free for personal use, and it comes with many of the packages we will use already installed.
Resources:
- A decent online (and free!) book is Think Python by Allen B. Downey. Features of Python particularly important to data science include:
- Dictionaries (and converting back and forth between other data structures)
- List Comprehensions
- Generators (as opposed to lists)
- Lambda Functions
- Classes
- Decorators
- Variable number of function arguments
Action Items
- Get Python running on your local machine. If you don't know where to start, use the Anaconda distribution.
- To get started with learning Python syntax, we suggest beginning with either the Learn python tutorial or this tutorial on the official Python website.
- Once you’ve gone through the above short tutorials, go to Project Euler and use Python to solve at least 5 problems. Try to choose problems that allow you to practice using dictionaries, list comprehensions, and other Pythonic features.
Now that you've spent time learning Python, continue to improve your proficiency with the language. You want to spend this day getting familiar with the extensive Python standard library and reinforcing any knowledge gaps you have encountered from the previous day. There is no problem with having to spend more time on the material of the first two days. It's really important you start the program being comfortable programming with Python.
Resources
- The following are important modules that are part of the Python standard library. It's good to be aware of these before the Fellowship program starts.
collections
: While you can accomplish much with Python's built-in data structures such aslist
,tuple
,dict
, andset
, thecollections
module provides some specialized data structures that make your life more convenient. A few to know about areCounter
,namedtuple
, anddefaultdict
.itertools
: There are times when you need to come up with all possible permutations or combinations and iterate over them. While you might be able to programmatically create them and store them in a list, that would be memory inefficient, slow, and prone to error. Instead, consider using some of the tools offered byitertools
. In general, the module provides a set of fast and memory efficient iterators. A few iterators of note arepermutations
,combinations
, andproduct
.os
,glob
, andpathlib
: You may find yourself needing to do operating system tasks like creating and removing directories. The tools in theos
module will help make your code compatible across different operating systems. If you are looking for functions to work with paths, look intoos.path
. For example,os.path.join
is used to join paths intelligently, for example, properly using the correct directory separator for your operating system. Theglob
module lets you find all file paths satisfying a given pattern. For example, you can find all files in the "data" directory that end in.csv
using the patterndata/*.csv
.pathlib
is an alternative toos.path
andglob
. Instead of representing paths and files as strings, it provides aPath
class, offering an object-oriented way to work with paths. For example, with aPath
object, you can have access to methods to check for file existence and iterate over all subdirectories.
Action Items
- Continue with the Learn python tutorial or this tutorial on the official Python website.
- Take an assessment of the areas of Python you feel least comfortable with. Create an account on HackerRank, a platform for taking coding challenges. Under the Python section, look for challenges that target the areas you are least comfortable with. HackerRank even has a Python skills assessment test you can take to better understand what areas you need to work on.
Python has a wonderful suite of libraries for doing scientific computing. The two main packages are numpy
and scipy
.
What is NumPy, and why is it relevant for scientific computing? NumPy is a Python package that provides, among other things, a multi-dimensional array object, the ndarray
, which allows for very fast array operations. NumPy is written in the C programming language, allowing the library to leverage the speed of C for fast vectorized operations. Rather than using slow for
loops, where possible you should always vectorize your operations using NumPy.
NumPy also forms the basis of, or is used extensively by, other packages such as scipy
and scikit-learn, so having a firm understanding of NumPy is useful for using other packages.
scipy
is a package that provides many efficient numerical routines such as numeric integration, optimization, linear algebra, and statistics methods. While useful for scientific computing, we won't use scipy
much in the Fellowship program.
Resources
- An excellent tutorial on NumPy.
- If you like videos, try watching this one and following along with their GitHub code.
- For those of you familiar with MATLAB, here’s a guide for translating your MATLAB knowledge into NumPy.
Action Items
- Make certain that you know how to create an array in
numpy
. There are many different ways this can be done, such as using thearray
function,zeros
andones
for pre-filled in arrays of given sizes, and thearange
andlinspace
functions for sequences. - Create some NumPy arrays, manipulate them with basic operations, and try out some of the
numpy
universal functions. - Do the one-star exercises and as many of the two-star exercises as you can from the NumPy 100 exercises page.
pandas
is the Python package for data analysis and manipulation. It's a package designed to make working with relational (or tabular) data fast and easy. The package provides many methods for performing data analysis, data manipulation and aggregation, as well as many visualization tools (largely built on top of the matplotlib
library). If you are familiar with R, then you'll notice similarities with R's data.frame
.
pandas
is built on top of the numpy
package, hence a good understanding of numpy
is valuable for working easily in pandas
, and pandas
is designed to integrate well with other Python libraries such as scikit-learn
and statsmodels
. pandas
also has great capabilities for working with time series data and includes a large suite of time series-specific methods.
Resources
- The homepage of pandas is a great place to start, as are the Getting started tutorials. Important things to know include:
- How to read data from a CSV (comma-separated value) file to create a DataFrame.
- How to filter data in a DataFrame.
- How to compute summary statistics for a DataFrame.
- How to use the
groupby
method to aggregate data. - How to dump results of analysis to a CSV file.
- There are several nice pandas tutorial videos on YouTube. Here is a playlist of a series we liked. It's broken up by topic so feel free to skip around and focus on areas you need to strengthen.
Action Items
Download the data set about Value of Energy Cost Saving Program for businesses in New York City (under the "Export" option, there is a way to retrieve a CSV file). Answer the following questions.
- How many different companies are represented in the data set?
- What is the total number of jobs created for businesses in Queens?
- How many different unique email domains names are there in the data set?
- Considering only NTAs with at least 5 listed businesses, what is the average total savings and the total jobs created for each NTA?
- Save your result for the previous question as a CSV file.
Matplotlib is Python's most popular plotting package. It provides an object-oriented programming API and a procedural API similar to that of MATLAB. While much can be done with Matplotlib, for both nicer looking and statistically focused visualization, consider using Seaborn, which extends and builds on Matplotlib.
With these plotting libraries, don't focus on trying to memorize the entire API. A good starting point is to visit the examples page and look for similar visualizations of what you are trying to build. Altair and Bokeh are two Python packages for creating interactive visualizations. In the Fellowship program, we'll focus on using Altair.
Action Items
- Go through the Introductory Tutorials on Matplotlib.
- Using the same data set and results that you were working with in the pandas action items section (Day 4), create a
- scatter plot of jobs created versus average savings. Use both a standard and a logarithmic scale for the average savings.
- histogram of the log of the average total savings.
- line plot of the total jobs created for each month.
- If you have time, take a look at this short tutorial on Altair.
You did it! Congratulations on completing the prep materials and taking the first step in preparing for your program.
How was it?
If you struggled, that’s ok! There are plenty of skill levels that apply and get accepted into our bootcamps, so you’re always welcome to apply.
Now that you’ve got this under your belt, take a peek at what our admissions process looks like past this point and see what other steps you might need to take.
Not ready for a bootcamp? We’ve got you covered. Take our “data 101” course, Data Science Essentials to whip your data basics into shape. You’ll strengthen and build upon your current data skills and improve your Python mastery. You’ll also get automatic entry into our next bootcamp when you graduate from Essentials.
Need More Info? Talk with Admissions!
- Learn more about the application process from the admissions team
- Learn about all our course options and the differences between them
- Determine the best financing options for your needs
- Get the answers to all your questions
Meet the Instructors
Robert Schroll
Don Fox
Ana Hocevar
Russell Martin
Richard Ott
Nicholas Dela Fuente
Meet Alumni Who've Been in Your Shoes
TDI gave me my first insight into data science in the business world. Thoroughly designed mini-projects are excellent stepping stones to tackling real-world problems. A good amount of well-sorted study materials and meticulously designed notebooks are always there to help you to excel in your job interviews.
TDI introduced me to a wide range of material and gave me the confidence to pursue a Data Science career. Particularly with my Capstone project, I tailored it to the Healthcare and Marketing industry. My project presentation in my interview directly led to me beating out other applicants with more experience than myself. The experience and feedback received via the Capstone Project was instrumental in getting me the job I have now.
TDI provided me with a great introduction to the platforms and technologies in ML used in the industry. They selected projects and exercises emulating a real ML work environment. Working side by side with instructors and fellows was a crucial part of the learning experience. It also helped to build connections that I still value greatly.
Meet Alumni Who've Transitioned From Academia
“I’m a recovering academic,” says Dr. Andrew Grazyk. He was one of the few who managed to stay in academia for a couple of years before making the switch to industry. He completed his PhD in Economics in 2017 before applying to TDI in 2019 when his research projects dried up and his ambitions changed. He was stuck—he needed to learn new computational techniques to break into the industry, but didn’t have proper structure and guidance to gain them. Until The Data Incubator popped up in a google search. He accredits our instructors as the key component in the development of his skills.
For Newton Le, “TDI was my way out.” Newton found himself without opportunities for research which left him unable to secure an academic position. The data science fellowship was just what he needed to secure a job outside of research and into the computational problem-solving he enjoyed. He landed his first job with Crunchbase through our hiring partner program after he joined TDI in 2016. Within months of graduation, he was working as a data engineer and is now the tech lead for the machine learning pipelines team at Twitter—where he’s been since 2017 and calls his journey “sorta unbelievable.”
Hechen Ren remembers watching many of her peers’ research hit trends that her field in physics just didn’t allow. The oversaturation of her post-doc didn’t lend her any favors either. So in 2019, after chatting with industry friends, she was convinced that TDI was going to bring her the success she was looking for. “It was important for me to find something that struck a balance between value and passion,” she says, but she found that balance by utilizing the statistical modeling skills she learned at TDI for the research and development department at Afiniti where she’s been since graduating.
FAQ
TDI isn’t your typical program. We work hard to ensure that we offer the best training from the best instructors, using the latest tools and real-world data so you can feel confident when you step into your new job in data.
- Cohort-style program. Make the transition from academia to the business world, or enhance your data skills while you work with an excited and exceptional peer group. We aim to keep our cohorts intimate, to maximize your interaction with instructors.
- Hands-on experience. All of the projects you complete in the program are designed to give you experience with real data sets while solving real business problems.
- Accessible instructors. Every section has a dedicated data instructor to lead discussions and assist students.
- Mentorship from industry leaders. Learn from alumni and senior data scientists and data analysts while you build your professional network.
- A large network of hiring partners. We work with a number of actively-hiring partner companies each cohort to help find you your dream career—and not just another job.
To apply to the Data Science or the Data Engineering programs, you’ll need to possess these requirements:
- A master’s degree or PhD completed before the beginning of the program, or a PhD that will be completed within 3 months of finishing the program, or a Bachelor’s degree with extensive data-related work experience
- Some prior programming experience
- A foundational knowledge of statistics for Data Science or computer science for Data Engineering
- Proficiency in the English language
If you’ve got all that, then you’re eligible to apply for the Data Science or the Data Engineering program.
To put it plainly, yes. We require all applicants to be familiar with at least one programming language to apply. The Data Science and the Data Engineering program application also requires applicants to take a coding challenge before getting accepted into the program.
While the programs emphasize Python, many of our applicants have experience with other languages, and have varying degrees of coding experience.
- For our Data Science applicants, they usually have a strong background in probability, statistics, and experience with programming, scripting, or statistical packages.
- We’ve had successful applicants with degrees in political science, economics, computer science, mathematics, physics, chemistry and more.
- For our Data Engineering applicants, they should have a strong background in computer science, software development, and experience with programming.
- Data Science Essentials is meant for students who are looking to learn essential data science skills, who may be new to Python and are looking for a more robust learning program, those interested in learning the fundamentals of quantitative analysis or who are looking to apply to the Data Science Fellowship.
To attend our full-time or part-time online Data Science Program as a scholar, the tuition is: $11,000
To attend our full-time or part-time online Data Science & Engineering Program the tuition is: $10,000
For those paying in cash for the Data Science or the Data Engineering programs, we offer two discount options:
- $2,000 for those who apply during our early admissions time frame
- $1,000 for those who apply during our regular admissions time frame
Financing is available through Ascent or other lenders of your choice.
We also offer income sharing, where you don’t pay anything for tuition until you start your career and are earning over a certain threshold. Income sharing is an agreement between you, The Data Incubator, and our partner, Leif. Check out the details under the question “What is income sharing?”
If you’re placed with one of our hiring partners after your training, we’ll refund 30% of your tuition*.
A limited number of scholarships are available for the full-time data science and engineering programs. Check out the details under the question “Do you offer scholarships?” to learn more.
We encourage you to put your best foot forward during your application for a chance to earn one of these coveted scholarships.
*Participants who use Ascent or Income Sharing Agreements to finance their participation are not eligible for the 30% placement refund.