The Science of Data Science: Alumni Spotlight on Paul George

Paul was a Fellow in our Fall 2016 cohort who landed a job with Cloudera.

Tell us about your background. How did it set you up to be a great Data Scientist?

Following the completion of my PhD in Electrical and Computer Engineering in 2009, I joined Palantir Technologies as a Forward Deployed Engineer (client-facing software engineer). There, I helped Palantir enter a new vertical, that of Fortune 500 companies, where I built data integration and analysis software for novel commercial workflows. I left Palantir in 2012 and in 2013 I co-founded SolveBio, a genomics company whose mission is to help improve the variant-curation process; the process by which clinicians and genetic counselors research genetic mutations and label them as pathogenic, benign, or unknown. At SolveBio, my work was primarily focused on building scalable data cleansing, transformation and ingestion infrastructure that could be used to power the SolveBio genomics API. I also worked closely with geneticists and other domain experts in a semi-client-facing role.

The theme of my six years as a software engineer has been to help domain experts, whether they be fraud investigators at a bank or clinicians at a hospital, analyze disparate data to make better decisions. I have built infrastructure in both Java and Python, have used large SQL and NoSQL databases, and have spent countless hours perfecting Bash hackery (or wizardry, depending on your perspective).

My experiences as a software engineer were very relevant to data science in that I learned many ways to access, manipulate, and understand a variety of datasets from a variety of sources in a variety of formats. As the adage goes, “Garbage in. Garbage out.” No more is this true than in data science. Performing good data science requires cleaning and organizing data, and I feel very comfortable with this process.

What do you think you got out of The Data Incubator?

Several things! First, I got much needed exposure to the “science” portion of data science. I learned the techniques, terminology, and the mathematics that every data scientist should know. Moreover, The Data Incubator is organized so as to promote collaboration and (lots of) discussion amongst fellows. Not only did this help with my understanding, but I definitely built several lasting personal relationships with other fellows.

What advice would you give to someone who is applying for The Data Incubator, particularly someone with your background?

I would advise that incoming candidates have at least an elementary-level understand of basic statistics (p-values, linear regression, etc.) and Python.

What is your favorite thing you learned at The Data Incubator?

I loved learning Pandas (seriously) and Scikit Learn. Both of these frameworks are very well architected and implemented, and can make your life as a data scientist substantially more productive.

Could you tell us about your Data Incubator Capstone project?

My Capstone Project involves using machine learning models to build market-neutral long/short trading strategies from quantitative financial data. The data I used was primarily sourced from CapIQ and was collected via a combination of web scraping and bulk downloading. I calculated year-over-year rank-ordering of over 20+ financial indicators and used them to train classifiers whose output was ‘long’ or ‘short’ (i.e. positive growth vs. negative growth). I observed good portfolio performance which, expectedly, improved with the complexity of my models. My eventual goal is to integrate categorical datasets, such as sentiment extracted from filings and fraud indicators extracted from SEC documents, into my analysis and to backtest strategies more comprehensively.

And lastly, tell us about your new job!

I’ll be working for Cloudera as a Software Engineer. In particular, I’ll be joining a very small team that is focused on further-reducing the friction of doing data science at scale and in the “cloud.” We will be providing the abstraction required for data scientists and engineers to quickly and easily deploy Hadoop clusters so they can focus on analytics instead of infrastructure.
 

Learn more about our offerings:

Related Blog Posts

Moving From Mechanical Engineering to Data Science

Moving From Mechanical Engineering to Data Science

Mechanical engineering and data science may appear vastly different on the surface. Mechanical engineers create physical machines, while data scientists deal with abstract concepts like algorithms and machine learning. Nonetheless, transitioning from mechanical engineering to data science is a feasible path, as explained in this blog.

Read More »
Data Engineering Project

What Does a Data Engineering Project Look Like?

It’s time to talk about the different data engineering projects you might work on as you enter the exciting world of data. You can add these projects to your portfolio and show the best ones to future employers. Remember, the world’s most successful engineers all started where you are now.

Read More »
open ai

AI Prompt Examples for Data Scientists to Use in 2023

Artificial intelligence (AI) isn’t going to steal your data scientist job! Instead, AI tools like ChatGPT can automate some of the more mundane tasks in your future career, saving you time and energy. To make life easier, here are some data science prompts to get you started.

Read More »