Matthew was a Fellow in our Fall 2015 cohort who landed a job with one of our hiring partners, AdTheorent.
Tell us about your background. How did it set you up to be a great Data Scientist?
I was a computational high-energy nuclear physics experimentalist, working on the RHIC facility at Brookhaven National Lab on Long Island. This involved data mining on multi-PB of data to extract small signals on the order of hundreds of GB using distributed computing. This involved cleaning the data and developing algorithms to extract these signals. Once analyzed, these results were written up and published in peer reviewed journals.
What do you think you got out of The Data Incubator?
I was very happy with the Data Incubator course. I was introduced to many concepts that I had just touched on in some online courses I had taken prior to the course. By having weekly mini projects, I was forced to understand the material more. Whilst there were lectures to accompany the mini-projects, in order to complete them I was forced to do some research on my own which was a valuable thing. We weren’t just fed the answers.
On top of this, as an online Fellow, I appreciated being split into small groups. In my case, this worked really well and I would encourage others in the same situation to use this to their advantage and hold regular meetings.
I feel that I have made some good friends amongst the other Fellows which I will take forward with me as I start my career as a Data Scientist in industry.
What advice would you give to someone who is applying for The Data Incubator, particularly someone with your background?
Before you enter the Data Incubator, there is a 12-day onboarding checklist. Take this seriously, give yourself time to work on things that you don’t know (whether it is SQL or graph algorithms or whatever), it will be worthwhile in the end.
When you are working on your capstone project, spend some time on it before the course starts. This is an important project that you can use to showcase your skills and it can often come up in job interviews. The more time that you spend on it before the course starts, the more impressive your project will be and the more time you will have for the other course requirements. I was told by a number of employers that they really liked the projects that had web interfaces, so try and incorporate that into your project and don’t just have an iPython Notebook linked to on your github account.
If you are an online Fellow, don’t treat it as a course that you can spend a couple of hours a night on and still complete. It is likely that this will not work and if you do somehow pass the mini projects, you won’t get as much out of the course as people who treat it as a full-time job.
Mini projects are due on Saturday evenings – don’t leave these to the last minute, especially the ones using AWS as the queues will often be full with other Fellows trying to finish their projects. Start early so that you can finish early and prepare for the following week. [Editor’s note: We no longer have one shared AWS account for everybody; we encourage people to use their own to avoid queuing problems.]
What is your favorite thing you learned at The Data Incubator?
It is difficult to pick just one topic out of the many that were covered in the course but I think that the use of MRJob and Hadoop I found particularly interesting and perhaps the most relevant to my future career.
I also really liked the last project using Spark and Scala. This was something new and at first was a bit daunting but by the end of the week it was obvious how useful it can be and the brevity of the Scala code compared to using MRJob to accomplish the same task.
Could you tell us about your Data Incubator Capstone project?
For my project, I analyzed data on calls to the 311 non-emergency numbers in different cities. These were split into different categories and by location, based on latitude and longitude. I was able to visualize the time dependence of these requests and then build a model which can be used to predict the category of call being made. This could be useful to cities in order to streamline their responses to certain requests.
To make these predictions, I incorporated information on the weather by scraping the Weather Underground website and built these into the model. I created a website which people can use to visualize historical data and also make future predictions.