Python Multi-Threading vs. MultiProcessing

There is a library called threading in Python and it uses threads (rather than just processes) to implement parallelism. This may be surprising news if you know about the Python’s Global Interpreter Lock, or GIL, but it actually works well for certain instances without violating the GIL. And this is all done without any overhead — simply define functions that make I/O requests and the system will handle the rest.

Global Interpreter Lock & Python Threading

The Global Interpreter Lock reduces the usefulness of threads in Python (more precisely CPython) by allowing only one native thread to execute at a time. This made implementing Python easier to implement in the (usually thread-unsafe) C libraries and can increase the execution speed of single-threaded programs. However, it remains controvertial because it prevents true lightweight parallelism. You can achieve parallelism, but it requires using multi-processing, which is implemented by the eponymous library multiprocessing. Instead of spinning up threads, this library uses processes, which bypasses the GIL.

It may appear that the GIL would kill Python multithreading but not quite. In general, there are two main use cases for multithreading:

  1. To take advantage of multiple cores on a single machine
  2. To take advantage of I/O latency to process other threads

In general, we cannot benefit from (1) with threading but we can benefit from (2).

MultiProcessing in Python

The general threading library is fairly low-level but it turns out that multiprocessing wraps this in multiprocessing.pool.ThreadPool, which conveniently takes on the same interface as multiprocessing.pool.Pool.

One benefit of using threading is that it avoids pickling. Multi-processing relies on pickling objects in memory to send to other processes. For example, if the timed decorator did not wraps the wrapper function it returned, then CPython would be able to pickle our functions request_func and selenium_func and hence these could not be multi-processed. In contrast, the threading library, even through multiprocessing.pool.ThreadPool works just fine. Multiprocessing also requires more ram and startup overhead.

Analysis

We analyze the highly I/O dependent task of making 100 URL requests for random wikipedia pages. We compare:

  1. The Python requests module and
  2. The Python selenium with PhantomJS.

We run each of these requests in three ways and measure the time required for each fetch:

  1. In serial
  2. In parallel in a threading pool with 10 threads
  3. In parallel in a multiprocessing pool with 10 threads

Each request is timed and we compare the results.

Results

Firstly, the per-thread running time for requests is obviously lower than for selenium, as the latter requires spinning up a new process to run a PhantomJS headless browser. It’s also interesting to notice that the individual threads (particularly selenium threads) run faster in serial than in parallel, which is the typical bandwidth vs. latency tradeoff.

In particular, selenium threads are more than twice as slow, problably because of resource contention with 10 selenium processes spinning up at once.

Likewise, all threads run roughly 4 times faster for selenium requests and roughly 8 times faster for requests requests when multithreaded compared with serial.

Conclusion

There was no significant performance difference between using threading vs multiprocessing. The performance between multithreading and multiprocessing are extremely similar and the exact performance details are likely to depend on your specific application. Threading through multiprocessing.pool.ThreadPool is really easy, or at least as easy as using the multiprocessing.pool.Pool interface — simply define your I/O workloads as a function and use the ThreadPool to run them in parallel.

Improvements welcome! Please submit a pull request to our github.

Related Blog Posts

Moving From Mechanical Engineering to Data Science

Moving From Mechanical Engineering to Data Science

Mechanical engineering and data science may appear vastly different on the surface. Mechanical engineers create physical machines, while data scientists deal with abstract concepts like algorithms and machine learning. Nonetheless, transitioning from mechanical engineering to data science is a feasible path, as explained in this blog.

Read More »
Data Engineering Project

What Does a Data Engineering Project Look Like?

It’s time to talk about the different data engineering projects you might work on as you enter the exciting world of data. You can add these projects to your portfolio and show the best ones to future employers. Remember, the world’s most successful engineers all started where you are now.

Read More »
open ai

AI Prompt Examples for Data Scientists to Use in 2023

Artificial intelligence (AI) isn’t going to steal your data scientist job! Instead, AI tools like ChatGPT can automate some of the more mundane tasks in your future career, saving you time and energy. To make life easier, here are some data science prompts to get you started.

Read More »