Andrew was a Fellow in our Fall 2015 cohort who landed a job with one of our hiring partners in Washington, DC.
Tell us about your background. How did it set you up to be a great Data Scientist?
I’m an experimental nuclear physicist by training. I had the great privilege to perform research at the National Institute of Standards and Technology (NIST) for nine years. NIST is a Department of Commerce laboratory that specializes in the science of measurement (metrology) and its application to industry. My research focused on precision measurement techniques with neutrons to advance our understanding of fundamental physics and to improve industry services offered by my group.
There are two things that I think have helped me get to where I am:
1) Like most physicists, I think I have a natural propensity to tinker with things well outside my expertise. Taken too far, this can be a bad thing. But, applied appropriately, it’s exactly the kind of attitude needed to learn and keep up with the ever-changing field of data science.
2) Having focused on precision measurements in my research, I’ve seen time and time again how much the environment in which I performed my experiments impacted the data and informed my analysis. The parallel to data science is that my training has taught me that a deep understanding of the problem and how the data was collected are what allow you to ask the right questions and produce meaningful results.
What do you think you got out of The Data Incubator?
Entering The Data Incubator, I knew I’d be getting a solid foundation on current data science tools, as well as the opportunity to carry out a data science project of my own making. But I ended up with much more than that. Here are three things that come to mind:
1) An appreciation for how the skills I’ve developed as a physicist are applicable to problems in industry.
2) The opportunity to take on challenging problems alongside Fellows from a range of academic and technical backgrounds.
3) The beginnings of a very solid professional network in data science composed of Fellows in my cohort, program alumni, Data Incubator staff, and the numerous hiring partners we interacted with.
What advice would you give to someone who is applying for The Data Incubator, particularly someone with your background?
I feel that a physicist will either already have, or can demonstrate the ability to learn, the necessary technical tools to become a data scientist. Where I think we, as a field, are sometimes lacking is our experience explaining our work to those who are not subject matter experts. So, I would offer a gentle reminder that as a data scientist, your ability to explain your work will be just as important as your ability to do the work. I think that an applicant that showcases the ability to empathize with an audience and deliver clear, concise messages will have an edge.
I would also recommend that you start working with Python, if you haven’t already. You’ll be using it daily at the Incubator, and you don’t want your understanding of the language to get in the way of all the techniques you’ll be learning.
What is your favorite thing you learned at The Data Incubator?
A couple of things:
1) I learned to make the Jupyter Notebook a big part of my workflow. It’s a fantastic tool for things ranging from messy exploratory work to analyses that will be shared with others.
2) I learned that Slack is a wonderful tool for working with a distributed team. Just be careful not to abuse the secret commands too much!
Could you tell us about your Data Incubator Capstone project?
Every day, hundreds of thousands of people rely on Washington DC’s Metrorail to get them to their destinations in a timely and safe manner. But in the most recently reported quarter (Q3 CY15), on-time percentage dipped to below 80%. While there is an official system for alerting riders about delays, it does not always provide actionable advice, nor are there any guarantees of accuracy. In the preliminary study for my Data Incubator project, I found that, in one week, the official alert system was completely silent for two weekday afternoons, missing a total of 33 incidents that were eventually acknowledged in the official daily service reports. This left me asking if there was another way to find delays.
It turns out that Metro riders are a very vocal bunch on social media, especially when it comes to service issues. My project used this data to make predictions about delays. I gathered Tweets using a set of searches on hashtags and user account mentions and scraped the daily service report archive to generate a training set. From the Tweets, I constructed a feature array containing temporal information, Tweet volume, and quantities related to the appearance of certain keywords in the Tweets (e.g. “delay” or “offloading”). This data was used to train a Random Forest classifier, which was then deployed to a web application. You can check it out and learn more about how the classifier was trained here.
Learn more about our offerings:
- Find out which program is best for you – will it be our Data Science Fellowship, our Data Analytics Program or our Data Science Essentials Course?
- Hiring Data Scientists
- Corporate data science training