Natural Language Processing in Python: NLTK vs. spaCy

The venerable NLTK has been the standard tool for natural language processing in Python for some time. It contains an amazing variety of tools, algorithms, and corpuses. Recently, a competitor has arisen in the form of spaCy, which has the goal of providing powerful, streamlined language processing. Let’s see how these toolkits compare.

Philosophy

board-2084777_960_720NLTK provides a number of algorithms to choose from. For a researcher, this is a great boon. Its nine different stemming libraries, for example, allow you to finely customize your model. For the developer who just wants a stemmer to use as part of a larger project, this tends to be a hindrance. Which algorithm performs the best? Which is the fastest? Which is being maintained? In contrast, spaCy implements a single stemmer, the one that the spaCy developers feel to be best. They promise to keep it updated, and may replace it with an improved algorithm as the state of the art progresses. You may update your version of spaCy and find that improvements to the library have boosted your application without any work necessary. (The downside is that you may need to rewrite some test cases.) As a quick glance through the NLTK documentation demonstrates, different languages may need different algorithms. NLTK lets you mix and match the algorithms you need, but spaCy has to make a choice for each language. This is a long process and spaCy currently only has support for English. 

Strings versus objects

NLTK is essentially a string processing library. All the tools take strings as input and return strings or lists of strings as output. This is simple to deal with at first, but it requires the user to explore the documentation to discover the functions they need. In contrast, spaCy uses an object-oriented approach. Parsing some text returns a document object, whose words and sentences are represented by objects themselves. Each of these objects has a number of useful attributes and methods, which can be discovered through introspection. This object-oriented approach lends itself much better to modern Python style than does the string-handling system of NLTK. A more detailed comparison between these approaches is available in this notebook.

Performance

An important part of a production-ready library is its performance, and spaCy brags that it’s ready to be used. We’ll run some tests on the text of the Wikipedia article on NLP, which contains about 10 kB of text. The tests will be word tokenization (splitting a document into words), sentence tokenization (splitting a document into sentences), and part-of-speech tagging (labeling the grammatical function of each word). timing It is fairly obvious that spaCy dramatically out-performs NLTK in word tokenization and part-of-speech tagging. Its poor performance in sentence tokenization is a result of differing approaches: NLTK simply attempts to split the text into sentences. In contrast, spaCy is actually constructing a syntactic tree for each sentence, a more robust method that yields much more information about the text. (You can see a visualization of the result here.)

Conclusion

While NLTK is certainly capable, I feel that spaCy is a better choice for most common uses. It makes the hard choices about algorithms for you, providing state-of-the-art solutions. Its Pythonic API will fit in well with modern Python programming practices, and its fast performance will be much appreciated. Unfortunately, spaCy is English only at the moment, so developers concerned with other languages will need to use NLTK. Developers that need to ensure a particular algorithm is being used will also want to stick with NLTK. Everyone else should take a look at spaCy.

Related Blog Posts

Moving From Mechanical Engineering to Data Science

Moving From Mechanical Engineering to Data Science

Mechanical engineering and data science may appear vastly different on the surface. Mechanical engineers create physical machines, while data scientists deal with abstract concepts like algorithms and machine learning. Nonetheless, transitioning from mechanical engineering to data science is a feasible path, as explained in this blog.

Read More »
Data Engineering Project

What Does a Data Engineering Project Look Like?

It’s time to talk about the different data engineering projects you might work on as you enter the exciting world of data. You can add these projects to your portfolio and show the best ones to future employers. Remember, the world’s most successful engineers all started where you are now.

Read More »
open ai

AI Prompt Examples for Data Scientists to Use in 2023

Artificial intelligence (AI) isn’t going to steal your data scientist job! Instead, AI tools like ChatGPT can automate some of the more mundane tasks in your future career, saving you time and energy. To make life easier, here are some data science prompts to get you started.

Read More »