Paul was a Fellow in our Fall 2016 cohort who landed a job with Cloudera.
Tell us about your background. How did it set you up to be a great Data Scientist?
Following the completion of my PhD in Electrical and Computer Engineering in 2009, I joined Palantir Technologies as a Forward Deployed Engineer (client-facing software engineer). There, I helped Palantir enter a new vertical, that of Fortune 500 companies, where I built data integration and analysis software for novel commercial workflows. I left Palantir in 2012 and in 2013 I co-founded SolveBio, a genomics company whose mission is to help improve the variant-curation process; the process by which clinicians and genetic counselors research genetic mutations and label them as pathogenic, benign, or unknown. At SolveBio, my work was primarily focused on building scalable data cleansing, transformation and ingestion infrastructure that could be used to power the SolveBio genomics API. I also worked closely with geneticists and other domain experts in a semi-client-facing role.
The theme of my six years as a software engineer has been to help domain experts, whether they be fraud investigators at a bank or clinicians at a hospital, analyze disparate data to make better decisions. I have built infrastructure in both Java and Python, have used large SQL and NoSQL databases, and have spent countless hours perfecting Bash hackery (or wizardry, depending on your perspective).
My experiences as a software engineer were very relevant to data science in that I learned many ways to access, manipulate, and understand a variety of datasets from a variety of sources in a variety of formats. As the adage goes, “Garbage in. Garbage out.” No more is this true than in data science. Performing good data science requires cleaning and organizing data, and I feel very comfortable with this process.
What do you think you got out of The Data Incubator?
What advice would you give to someone who is applying for The Data Incubator, particularly someone with your background?
I would advise that incoming candidates have at least an elementary-level understand of basic statistics (p-values, linear regression, etc.) and Python.
What is your favorite thing you learned at The Data Incubator?
I loved learning Pandas (seriously) and Scikit Learn. Both of these frameworks are very well architected and implemented, and can make your life as a data scientist substantially more productive.
Could you tell us about your Data Incubator Capstone project?
My Capstone Project involves using machine learning models to build market-neutral long/short trading strategies from quantitative financial data. The data I used was primarily sourced from CapIQ and was collected via a combination of web scraping and bulk downloading. I calculated year-over-year rank-ordering of over 20+ financial indicators and used them to train classifiers whose output was ‘long’ or ‘short’ (i.e. positive growth vs. negative growth). I observed good portfolio performance which, expectedly, improved with the complexity of my models. My eventual goal is to integrate categorical datasets, such as sentiment extracted from filings and fraud indicators extracted from SEC documents, into my analysis and to backtest strategies more comprehensively.