At The Data Incubator we pride ourselves on having the latest data science curriculum. Much of our course material is based on feedback from corporate and government partners about the technologies they are looking to learn. However, we wanted to develop a more data-driven approach to what we teach in our data science corporate training and our free fellowship for
Data science masters and PhDs looking to begin their careers in the industry.
This report is the first in a series analyzing data science related topics. We thought it would be useful to the data science community to rank and analyze a variety of topics related to the profession in a simple, easy to digest cheat sheet, rankings or reports.
This first report ranks R packages for machine learning, and we’re hoping to stir the pot a bit and get our colleagues to join the discussion. Our discoveries here aren’t final, but rather serve to showcase the depth, and the breadth, of knowledge available to the data science community.
Like any good report, The Data Incubator Rankings will take into consideration a wide number of factors to showcase trends, and provide a sense of clarity as to programming biases, and preferred methods of the data science community. Disagree with our findings? Great, feel free to Drop us a line we love a good debate.
One of the most repeatedly asked questions by the data scientists we train is “what is the best programming language for machine learning?”
The resulting discussion, depending on the day, either ends in a hotly contested debate between R, Python, and MATLAB fans, or a full on WWE wrestling match. Ultimately the programming language of choice for machine learning comes down to three criteria:
- The type of problem the data scientist is working with
- Personal programming preferences
- The type of machine learning they’re looking to perform
In other words, it depends. However, R is the leading choice among data professionals who want to understand data, using statistical methods and graphs. It has several machine learning packages and advanced implementations for the top machine learning algorithms. R is also open source.
This project began as a ranking of the top packages for all data scientists, but we soon found that the scope was too broad. Data scientists do many different things, and you can classify almost any R package as helping a data scientist.
For this ranking The Data Incubator focused on a number of criteria including an exhaust list of ML packages, and three objective metrics- total downloads, GitHub stars, and the number of Stack Overflow questions.
What are the most popular machine learning packages? We took a look at a ranking based on package downloads and social website activity. The ranking is based on average rank of CRAN (TheComprehensive R Archive Network) downloads and Stack Overflow activity. CRAN downloads are from the past year.
Stack Overflow ranks the number of results based onpackage name in a question body, along with a tag ‘R’. GitHub ranking is based on the number of stars for the repository. See below for methodological details.
Conclusion: Top Packages For Data Science?
This project started as a ranking of the top packages for data science, but we soon found that the scope was too broad. Data scientists do many different things, and you can classify almost any R package as helping a data scientist. Should we include string manipulation packages? How about packages to read data from databases? A longer project, for another day, could be to use even more “Data Science” to come up with a ranking of the top R packages for doing “Data Science.”
Source code is available on The Data Incubator’s Github: https://github.com/thedataincubator/data-science-blogs/
If you’re interested in learning more, consider taking a look at the following: