The Science of Data Science: Alumni Spotlight on Paul George

Paul was a Fellow in our Fall 2016 cohort who landed a job with Cloudera.

Tell us about your background. How did it set you up to be a great Data Scientist?

Following the completion of my PhD in Electrical and Computer Engineering in 2009, I joined Palantir Technologies as a Forward Deployed Engineer (client-facing software engineer). There, I helped Palantir enter a new vertical, that of Fortune 500 companies, where I built data integration and analysis software for novel commercial workflows. I left Palantir in 2012 and in 2013 I co-founded SolveBio, a genomics company whose mission is to help improve the variant-curation process; the process by which clinicians and genetic counselors research genetic mutations and label them as pathogenic, benign, or unknown. At SolveBio, my work was primarily focused on building scalable data cleansing, transformation and ingestion infrastructure that could be used to power the SolveBio genomics API. I also worked closely with geneticists and other domain experts in a semi-client-facing role.

The theme of my six years as a software engineer has been to help domain experts, whether they be fraud investigators at a bank or clinicians at a hospital, analyze disparate data to make better decisions. I have built infrastructure in both Java and Python, have used large SQL and NoSQL databases, and have spent countless hours perfecting Bash hackery (or wizardry, depending on your perspective).

My experiences as a software engineer were very relevant to data science in that I learned many ways to access, manipulate, and understand a variety of datasets from a variety of sources in a variety of formats. As the adage goes, “Garbage in. Garbage out.” No more is this true than in data science. Performing good data science requires cleaning and organizing data, and I feel very comfortable with this process.

What do you think you got out of The Data Incubator?

Several things! First, I got much needed exposure to the “science” portion of data science. I learned the techniques, terminology, and the mathematics that every data scientist should know. Moreover, The Data Incubator is organized so as to promote collaboration and (lots of) discussion amongst fellows. Not only did this help with my understanding, but I definitely built several lasting personal relationships with other fellows.

What advice would you give to someone who is applying for The Data Incubator, particularly someone with your background?

I would advise that incoming candidates have at least an elementary-level understand of basic statistics (p-values, linear regression, etc.) and Python.

What is your favorite thing you learned at The Data Incubator?

I loved learning Pandas (seriously) and Scikit Learn. Both of these frameworks are very well architected and implemented, and can make your life as a data scientist substantially more productive.

Could you tell us about your Data Incubator Capstone project?

My Capstone Project involves using machine learning models to build market-neutral long/short trading strategies from quantitative financial data. The data I used was primarily sourced from CapIQ and was collected via a combination of web scraping and bulk downloading. I calculated year-over-year rank-ordering of over 20+ financial indicators and used them to train classifiers whose output was ‘long’ or ‘short’ (i.e. positive growth vs. negative growth). I observed good portfolio performance which, expectedly, improved with the complexity of my models. My eventual goal is to integrate categorical datasets, such as sentiment extracted from filings and fraud indicators extracted from SEC documents, into my analysis and to backtest strategies more comprehensively.

And lastly, tell us about your new job!

I’ll be working for Cloudera as a Software Engineer. In particular, I’ll be joining a very small team that is focused on further-reducing the friction of doing data science at scale and in the “cloud.” We will be providing the abstraction required for data scientists and engineers to quickly and easily deploy Hadoop clusters so they can focus on analytics instead of infrastructure.
 

Learn more about our offerings:

Related Blog Posts

data science portfolio

How to Build a Strong Data Science Portfolio: 5-Step Guide

So you want to be a data scientist? Great choice! Data scientists are still the hottest jobs around. But before you can start applying for data science jobs, you need to build a strong data science portfolio. A data science portfolio is a collection of your best data science projects that demonstrate your skills and abilities.

In this blog post, I’ll provide a 5-step guide on how to build a strong data science portfolio.

Read More »
imposter syndrome

Impostor Syndrome in Tech: What It Is, Why It Exists, and How to Overcome It

Impostor syndrome isn’t experienced in just certain industries or disciplines or only by certain individuals. It’s much more widespread than you may think. If you’re in the technology field, you may be familiar with this sentiment, but maybe you’ve never heard the term impostor syndrome. So, what exactly is impostor syndrome? What causes it? And how do people in data science, the tech field or STEM industries overcome it?

Read More »