Programmers tend to turn to Python for applications due to the convenience and it usually being programmer friendly – not exactly because it's a fast language or performs the best when you need performance. There is a plethora of third-party libraries while the language compensates heavily from not having the raw performance of Java or the C languages. The speed of development that Python offers takes precedence over speed of execution.
When Python is properly optimized the applications which people make with the language can run with surprisingly fast speeds – maybe not so much as fast as Java or C language fast – but it can run fast enough for web applications, data analytics, management tools, automation, and most general purposes. You could also forget that you were trading performance of your application for faster development productivity.
[ Take a look at the newest features that are worth mentioning from the latest beta of Python 3.10 ]
Optimizing the performance of Python doesn't really come down to this factor or that factor. It's more about applying the best practices while choosing the best fir for their own application at hand.
In this article I will cover the best optimizations to help people's Python applications gain more performance if possible such as changing the Python Interpreter. A lot of the biggest payoffs will require more detailed work than the options which are listed below.
Move Math to NumPy
If you are doing matrix-based or array-based math and you don’t want the Python interpreter getting in the way, use NumPy. By drawing on C libraries for the heavy lifting, NumPy offers faster array processing than native Python. It also stores numerical data more efficiently than Python’s built-in data structures.
Relatively unexotic math can be sped up enormously by NumPy, too. The package provides replacements for many common Python math operations, like
max, that operate many times faster than the Python originals.
Another boon with NumPy is more efficient use of memory for large objects, such as lists with millions of items. On the average, large objects like that in NumPy take up around one-fourth of the memory required if they were expressed in conventional Python. Note that it helps to begin with the right data structure for a job, an optimization itself.
Rewriting Python algorithms to use NumPy takes some work since array objects need to be declared using NumPy’s syntax. But NumPy uses Python’s existing idioms for actual math operations (+, -, and so on), so switching to NumPy isn’t too disorienting in the long run.
Convert to Cython
If you want speed, use C, not Python. But for Pythonistas, writing C code brings a host of distractions—learning C’s syntax, wrangling the C toolchain (what’s wrong with my header files now?), and so on.
Cython allows Python users to conveniently access C’s speed. Existing Python code can be converted to C incrementally—first by compiling said code to C with Cython, then by adding type annotations for more speed.
Cython isn’t a magic wand. Code converted as-is to Cython doesn’t generally run more than 15 to 50 percent faster because most of the optimizations at that level focus on reducing the overhead of the Python interpreter. The biggest gains come when you provide type annotations for a Cython module, allowing the code in question to be converted to pure C. The resulting speedups can be orders-of-magnitude faster.
CPU-bound code benefits the most from Cython. If you’ve profiled (you have profiled, haven’t you?) and found that certain parts of your code use the vast majority of the CPU time, those are excellent candidates for Cython conversion. Code that is I/O bound, like long-running network operations, will see little or no benefit from Cython.
As with using C libraries, another important performance-enhancing tip is to keep the number of round trips to Cython to a minimum. Don’t write a loop that calls a “Cythonized” function repeatedly; implement the loop in Cython and pass the data all at once.
Traditional Python apps—those implemented in CPython—execute only a single thread at a time, in order to avoid the problems of state that arise when using multiple threads. This is the infamous Global Interpreter Lock (GIL). The fact that there are good reasons for its existence doesn’t make it any less ornery.
The GIL has grown dramatically more efficient over time but the core issue remains. A CPython app can be multithreaded, but CPython doesn’t really allow those threads to run in parallel on multiple cores.
To get around that, Python provides the multiprocessing module to run multiple instances of the Python interpreter on separate cores. State can be shared by way of shared memory or server processes, and data can be passed between process instances via queues or pipes.
You still have to manage state manually between the processes. Plus, there’s no small amount of overhead involved in starting multiple instances of Python and passing objects among them. But for long-running processes that benefit from parallelism across cores, the multiprocessing library is useful.
As an aside, Python modules and packages that use C libraries (such as NumPy or Cython) are able to avoid the GIL entirely. That’s another reason they’re recommended for a speed boost.
You can’t miss what you don’t measure, as the old adage goes. Likewise, you can’t find out why any given Python application runs suboptimally without finding out where the slowness actually resides.
Start with simple profiling by way of Python’s built-in
cProfile module, and move to a more powerful profiler if you need greater precision or greater depth of insight. Often, the insights gleaned by basic function-level inspection of an application provide more than enough perspective. (You can pull profile data for a single function via the
Why a particular part of the app is so slow, and how to fix it, may take more digging. The point is to narrow the focus, establish a baseline with hard numbers, and test across a variety of usage and deployment scenarios whenever possible. Don’t optimize prematurely. Guessing gets you nowhere.
Run with PyPy
CPython, the most commonly used implementation of Python, prioritizes compatibility over raw speed. For programmers who want to put speed first, there’s PyPy, a Python implementation outfitted with a JIT compiler to accelerate code execution.
Because PyPy was designed as a drop-in replacement for CPython, it’s one of the simplest ways to get a quick performance boost. Many common Python applications will run on PyPy exactly as they are. Generally, the more the app relies on “vanilla” Python, the more likely it will run on PyPy without modification.
However, taking best advantage of PyPy may require testing and study. You’ll find that long-running apps derive the biggest performance gains from PyPy, because the compiler analyzes the execution over time. For short scripts that run and exit, you’re probably better off using CPython, since the performance gains won’t be sufficient to overcome the overhead of the JIT.
Note that PyPy’s support for Python 3 is still several versions behind; it currently stands at Python 3.2.5. Code that uses late-breaking Python features, like async and await co-routines, won’t work. Finally, Python apps that use ctypes may not always behave as expected. If you’re writing something that might run on both PyPy and CPython, it might make sense to handle use cases separately for each interpreter.