Marc was a Fellow in our Spring 2016 cohort in San Francisco who’s now working at Google as a Computational Linguist.
Tell us about your background. How did it set you up to be a great data scientist?
I started life as a programmer, then I went back to graduate school at UC Berkeley for linguistics where I actually didn’t use my programming skills for a while. From there I did a neuroscience postdoc, where I started to use my programming skills a bit more. The neuroscience endeavor is, in many ways, a big-data endeavor. You get a tremendous amount of data from doing neuroscience experiments, and figuring out how to interpret and make sense of that incredible amount of data requires techniques that are not typical common in the behavioral scientific world, but are quite typical of machine learning and related data science fields. The way I would use those techniques as a scientist were not particularly sophisticated and while there is a lot of data when you’re analyzing neuroscience, it’s still orders of magnitude less than what people typically think of with big data.
So, when I started looking for job opportunities outside of academia, I realized that the way I talked about data analysis and the techniques that I used were not up to date. I wasn’t using the latest methodologies, tools, and terminologies that data scientists used even though the basic concepts were much the same.
What do you think you got out of The Data Incubator?
The key thing was being able to talk about data science intelligently in a way I hadn’t before, during interview. I was able to update my knowledge to where the field and industry currently is, which helped tremendously talking with prospective employers. I also learned about some ideas and concepts that helped make me be a better data scientist, reflecting the latest research within the field.
What advice would you give to someone who is applying for The Data Incubator, particularly someone with your background?
I suspect that there are a lot of very obvious projects that people propose over and over again for their capstone — including mine. So I would suggest that you use the creativity that you used to come up with you dissertation topic, in terms of building on what’s already out there but going into new territory, bring that sort of creativity and thought process to the project that you propose for The Data Incubator. You don’t want to do anything too obvious, but also nothing too novel that you can’t build on other people’s work and actually get somewhere.
Could you tell us about your Data Incubator Capstone project?
My capstone project incorporated natural language processing, machine learning and a public social media API. What I did was I looked at the sentiment of people’s tweets, looked at it geographically, then looked for a relationship between that and other features distributed geographically including things like socio-economic status. I looked at tweeters in, say Nevada, and saw that their tweets seemed to be really positive and their income seemed to be really high, while tweeters in Mississippi seemed to be kind of sad and their socio-economic status tended to be low. Those were the sort of relationships I was looking for. I was able to map people’s tweet happiness at the county level and then look at some correlations with some of these other socio-economic variables and model that.
How did you come up with the idea for the project?
Honestly, I just wanted to do something involving language. machine learning and some sort of public API, like a social media API. Those things just came together in this project, though I think twitter sentiment is a bit of an overworked topic.
What technologies did you use and what skills did you learn at TDI that you applied to the project?
Definitely one of the big skills I learned was how to put together a project from end to end, from the back end to the front end, and to stand up an application from scratch, which was very new to me. The technologies I used were the twitter API, some NLTK and spaCy libraries for doing sentiment analysis, and a few additional libraries that other people had made available for additional sentiment analysis and bootstrap and Flask for the front end.
What was your most surprising or interesting finding?
I guess maybe this shouldn’t be surprising, but at the time it was a challenge that I found it interesting was how non-uniformly distributed people’s tweets are around the country, even in places you’d think they would be evenly distributed. So, if you’re looking at, say, the Boston area, you find tweets over a one week span all over the place, but then, for reasons I don’t totally understand, you’ll have certain zipcodes that just have no tweets at all, like Chinatown. Maybe that makes sense, in retrospect, but Boston’s a very dense area, so you can imagine what it’s like in the rest of the country where you might have no people tweeting in the entire western half of South Dakota for a week. It’s just a big empty space on your map and you have to figure out how to approach that. It’s a canonical missing-data problem, but one I didn’t expect to have.
Describe the business application for this project (how could a company use your work or your data?).
The idea was that it would be pretty trivial to add additional query filtering on people’s tweets so that you could add brand names of products to it and ask what people’s sentiment in North Dakota towards Cheerios, Hillary Clinton, the Boston Red Sox and so on. That is, you can juxtapose the geographic distribution of sentiment with anything else to know the geographic distribution of sentiment for that thing. I only had time to look at overall distribution of sentiment, but you could slice things a bit more to allow for consumer sentiment research or market research.
And lastly, tell us about your new job!
Now I’m working at Google, after doing a year at an ad-tech company, coming out of the Fellowship. I’m helping to build out capabilities for Google Assistant, so that it becomes more and more like speaking to a person, and can cover more and more capabilities that people might actually want from their digital assistants.
Learn more about our offerings: