Brendan was a Fellow in our Fall 2015 cohort who landed a job with one of our hiring partners, Jolata.
Tell us about your background. How did it set you up to be a great Data Scientist?
I did my PhD research in theoretical condensed matter physics at the University of California, Santa Barbara. The focus of my research was on studying the phase diagram of chains of non-abelian anyons. Because such chains are gapless in most regions of the phase diagram we had to model them using very large matrices in C++. To make this computation more tractable we used hash tables and sparse matrices. Besides my background in numerics I also took the time to learn Python, Pandas, SQL and MapReduce in Cloudera a few months before starting the fellowship.
What do you think you got out of The Data Incubator?
What advice would you give to someone who is applying for The Data Incubator, particularly someone with your background?
Get familiar with SQL, python and machine learning well before applying to the program. Also, get familiar with analyzing text data and learn how to deal with unicode.
What is your favorite thing you learned at The Data Incubator?
Learning MapReduce and Spark on clusters was particularly useful. There are some subtle differences between running your code on a single node vs a cluster which are important to know. The miniprojects are especially useful when talking to employers because they are typically looking for someone with background knowledge covered in at least one of the miniprojects (in my case recommender systems), which may not have been covered in the capstone project.
Could you tell us about your Data Incubator Capstone project?
There’s been a lot of hype around sensor data and how it could be used in wearable devices or smart cities. The aim of my projects was much more modest. I wanted to see if sensors installed on a product could be used to “review” it, just like customers do when they post an online review.
I looked at daily sensor data from 40,000 computer hard drives owned by Backblaze and compared their failure rate and life expectancy by model to the perceived failure rate and life expectancy obtained from scraping online reviews on Amazon and Newegg. Because there is approximately a linear correlation between star rating of the review and the perceived failure rate of a hard drive I was able to map the sensor data to an expected star rating for each hard drive model. This expected star rating represents the overall rating that the hard drive would receive it were rated by sensors rather then human reviewers.