At The Data Incubator, we pride ourselves on having the most up to date data science curriculum available. Much of our curriculum is based on feedback from corporate and government partners about the technologies they are using and learning. In addition to their feedback we wanted to develop a data-driven approach for determining what we should be teaching in our fellowship for masters and PhDs looking to enter data science careers in industry. Here are the results.
This technical article was written for The Data Incubator by Brett Sutton, a Fellow of our 2017 Summer cohort in Washington, DC.
When you’re getting started on a project that requires doing some heavy stats and machine learning in Python, there are a handful of tools and packages available. Two popular options are scikit-learn and StatsModels. In this post, we’ll take a look at each one and get an understanding of what each has to offer.
Scikit-learn’s development began in 2007 and was first released in 2010. The current version, 0.19, came out in in July 2017. StatsModels started in 2009, with the latest version, 0.8.0, released in February 2017. Though they are similar in age, scikit-learn is more widely used and developed as we can see through taking a quick look at each package on Github. Both packages have an active development community, though scikit-learn attracts a lot more attention, as shown below.
Much more is going on with scikit-learn across all these activity metrics. Each project has also attracted a fair amount of attention from other Github users not working on them themselves, but using them and keeping an eye out for changes, with lots of coders watching, rating, and forking each pakcage. A quick search of Stack Overflow shows about ten times more questions about scikit-learn compared to StatsModels (~21,000 compared to ~2,100), but still pretty robust discussion for each.
Checking out the Github repositories labelled with scikit-learn and StatsModels, we can also get a sense of the types of projects people are using each one for. Both sets are frequently tagged with python, statistics, and data-analysis – no surprise that they’re both so popular with data scientists. The differences between them highlight what each in particular has to offer: scikit-learn’s other popular topics are machine-learning and data-science; StatsModels are econometrics, generalized-linear-models, timeseries-analysis, and regression-models. These topic tags reflect the conventional wisdom that scikit-learn is for machine learning and StatsModels is for complex statistics.
The topic differences reflect a division in the machine learning and statistics communities that’s been the source of a lot of discussion in forums like Quora, Stack Exchange, and elsewhere. Statisticians in years past may have argued that machine learning people didn’t understand the math that made their model work, while the machine learning people themselves might have said you can’t argue with results! Today, the fields have more and more in common, and a good head for statistics is crucial for doing good machine learning work, but the two tools do reflect to some extent this divide.
Scikit-learn offers a lot of simple, easy to learn algorithms that pretty much only require your data to be organized in the right way before you can run whatever classification, regression, or clustering algorithm you need. The pipelines provided in the system even make the process of transforming your data easier. Of course, choosing a Random Forest or a Ridge still might require understanding the difference between the two models, but scikit-learn has a variety of tools to help you pick the correct models and variables. With a little bit of work, a novice data scientist could have a set of predictions in minutes. At The Data Incubator, students gain hands-on experience with scikit-learn, using the package for image analysis, catching Pokemon, flight analysis, and more.
Though StatsModels doesn’t have this variety of options, it offers statistics and econometric tools that are top of the line and validated against other statistics software like Stata and R. When you need a variety of linear regression models, mixed linear models, regression with discrete dependent variables, and more – StatsModels has options. It also has a syntax much closer to R so, for those who are transitioning to Python, StatsModels is a good choice. As expected for something coming from the statistics world, there’s an emphasis on understanding the relevant variables and effect size, compared to just finding the model with the best fit.
Both scikit-learn and StatsModels give data scientists the ability to quickly and easily run models and get results fast, but good engineering skills and a solid background in the fundamentals of statistics are required. Finding the answers to tough machine learning questions is crucial, but it’s equally important to be able to clearly communicate, to a variety of stakeholders from a range of backgrounds, how and why the models work. For this reason, The Data Incubator emphasizes not just applying the models but talking about the theory that makes them work.