At The Data Incubator we run a free eight-week Data Science Fellowship Program to help our Fellows land industry jobs. We love Fellows with diverse academic backgrounds that go beyond what companies traditionally think of when hiring Data Scientists. William was a Fellow in our winter cohort who landed a job with one of our hiring partners, Dataminr.
Tell us about your background. How did it set you up to be a great Data Scientist?
What do you think you got out of The Data Incubator?
Also, I especially enjoyed participating as part of the cohort in the NYC Data Incubator. Being located in NYC and interacting with the other Fellows was very valuable. It helped me to understand my own strengths relative to others looking for similar opportunities. Coming together as a group all at once with similar goals to a new city helped to turn what is usually an individual effort into a more collaborative process. The connections formed within the group will help us far into our future careers.
What advice would you give to someone who is applying for The Data Incubator, particularly someone with computer science backgrounds?
First, practicing fundamental computer science problems beyond one’s research area is critically important. In interviews, these kinds of questions are ubiquitous, so whiteboard coding of basic algorithms (sorting, search), data structures, and statistics is always going to be helpful for future job searching. There are many books and websites about this topic, but there is no replacement for the practice time!
Second, having a portfolio of one or two items from recent past efforts is valuable when trying to demonstrate your value to a potential future employer. It is really much more impactful to show a future boss what you have done than it is to tell them.
Third, being able to work quickly on a new problem or dataset to develop actionable insights is a very valuable skill for data scientists. It’s important to become comfortable with a specific tool-set and to rapidly produce statistics, models, and visualizations that can be used to explain the underlying phenomena to others. I think it is important to practice digesting new datasets from many domains from scratch without any specific goal. Being inquisitive is an extremely valuable habit in this business. [Editor’s Note: for more information on preparing to be a Data Scientist, check out this previous blog post.]
Could you tell us about your Data Incubator taxi project? It was really cool; we’d love to feature it.
My Data Incubator project considers the following problem: From my current location on the street in NYC, what is the best nearby location I should walk toward to maximize my chance of finding a taxi quickly? Should I walk west or east? Should I walk two blocks more to maximize my chances? My project, nyctaxi.me, is a smartphone-friendly web-app that answers these questions for you. Using your current GPS location, the current time of day, and a large corpus of historical data, the site determines the best 7 nearby intersections that you should walk toward to minimize your taxi hail wait. Also, we estimate the total walking and waiting time for each of the potential locations, and these locations are ranked based on the total walking plus waiting time. I encourage everyone to try the site. I found this was a great ice-breaker in my interactions with companies because it solves a problem that many long-time New Yorkers as well as newcomers face. Additionally, I benefitted by using this taxi hot-spot finder a couple of times as a Data Incubator Fellow on my way to events!
How does it work? At the risk of giving too much detail, I will describe some specifics. The dataset I used was a collection of all year 2013 yellow taxi fares (starting GPS location, ending GPS location, total distance, fare, tolls, passenger count, etc.) which was publicly released by the NYC Taxi and Livery Commission. I map each of the fare end points to the nearest road intersection to facilitate generalization and reduce noise in the location data. Using GIS data on the road network, I find there are over 100,000 road intersections within the city. Then I characterize the taxi drop-off and pickup frequency at each road intersection by estimating these event frequencies conditioned on the time of the day and the day of the week. These pickup and drop-off frequencies show in the nyctaxi.me map visualization as a colored dot centered over each intersection. Using these frequencies and the user’s current GPS location, we compute the walking distance using the “Manhattan” distance to all nearby intersections. We also compute the average expected waiting time (1/2 of the mean inter-arrival time) at each nearby intersection based on the current time. Finally we rank the nearest intersections from best to worst based on total walking plus expected waiting time.
This dataset (about 50GB in size consisting of over 170 million taxi fares) has been used in previous visualization/analysis projects (and even for academic research). But I feel my project is unique because it can help to solve a real problem faced by individuals in the real world every day.
Visit our website to learn more about our offerings:
- Data Science Fellowship – a free, full-time, eight-week bootcamp program for PhD and master’s graduates looking to get hired as professional Data Scientists in New York City, Washington DC, San Francisco, and Boston.
- Hiring Data Scientists
- Corporate data science training
- Online data science courses: introductory part-time bootcamps – taught by our expert Data Scientists in residence, and based on our Fellowship curriculum – for busy professionals to boost their data science skills in their spare time.