At The Data Incubator we run a free advanced 8-week fellowship for PhDs looking to enter the industry as data scientists.
As part of the application process, we ask potential fellows to propose and begin working on a data science project to highlight their skills to employers. Regardless of whether you’re selected to be a fellow, this project will be instrumental in attracting employer interest and highlighting your skills. Here are some projects that we would love to see, and that we hope to see you take on as well.
Multi-Axial Political Analysis
We often think of American politics in terms of a single axis: left versus right, democrat versus republican. In reality, the parties are composed of varying factions with different identities and political priorities and American politics is actually broken along multiple axes: foreign policy, social issues, regulation, social spending, education, second amendment, just to name a few.
Voters often do not completely agree with their party on all issues. While it’s hard to distinguish this from two-way state and national races, primary elections and ballot measures offer a multidimensional probe of voter sentiment that goes beyond two-way elections. Leverage unstructured learning to find clusters of precincts that tend to vote similarly and visualize them on a map. Are there patterns that defy the traditional partisan ones?
There’s plenty of data out there- for example, the states of Colorado, and California maintain official state repositories of data. Projects like LEAP, the CEDA, Seattle Times, or Open Elections provide third-party data.
Quantifying Academic Publication Prestige
When applying for a PhD, it’s hard to quantitatively determine which professors to select as a thesis advisor. And while US News and World Report provides graduate school rankings for course fields, it’s harder to determine which universities are the best for your subfield of interest.
While journals provide a proxy for rankings, knowledge of top journals for fields (much less subfields) is extremely rarified. However, SSRN, arXIv, PubMed, and Google Scholar, provide extensive archives of papers and citation lists to help identify the influence of papers, authors, institutions, and journals.
Tracking Grants
The US government gives us billions in science grants each year through the NSF, the NIH, DARPA, amongst others. You can track grants from proposal to paper. For example, the NSF releases all its grant data in XML and grant recipients are required to publically acknowledge their grant number in publications.
Other funding agencies have similar rules. How long does it take to write a paper and does that vary by discipline? How much does it cost to produce a paper? Are there certain fields or subfields that are more or less expensive? Can you use natural language processing to identify research topics that did not produce what they were promised?
Public Salaries
In the spirit of public transparency, states publish the compensation for all state employees (as a random sample, here are ones for Illinois, Rhode Island, California, Washington State). Which public sector professions or geographies pay better? How much would salaries increase over time? How does this vary by state? Can we correlate teacher pay with school ratings or hospital performance? Does university professor pay track with the prestige of their publication record?
Trending Technology Projects
Stackoverflow offers a comprehensive dump of all its data (available here). You can use this data to identify which users are experts at different subjects (top respondents), which topics get the most traction (plentiful questions and answers), and which topics are naturally harder (have responses only from highly experienced users). Can you use natural language processing to identify a good question (presumably one that receives answers, receives many upvotes, or isn’t marked as off topic) or a good answer (presumably one that receives many upvotes or is marked as “accepted”)?