20 Tips to make your Python fly!

This article, shared today, is not much text, code-based. Absolute dry goods, child's play, mainly share 20 tips to improve Python performance, teach you how to say goodbye to slow Python. original author Kaiyuan, full-stack programmer, using Python, Java, PHP and C++.

1. Optimization algorithm time complexity

The time complexity of an algorithm has the greatest impact on the execution efficiency of the program, in Python you can optimize the time complexity by choosing the right data structure, such as list and set to find an element of the time complexity is O(n) and O(1) respectively. There are different optimizations for different scenarios, in general, there are generally ideas such as partitioning, branch bounds, greedy, and dynamic programming.

2. Reduction of redundant data

If you use upper or lower triangulation to preserve a large symmetric matrix. Use a sparse matrix representation in matrices with a majority of 0 elements.

3. Rational use of copy and deepcopy

For objects with data structures such as dict and list, direct assignment is done by reference. In some cases you need to copy the whole object, you can use copy and deepcopy from the copy package, the difference between these two functions is that the latter copies recursively. The efficiency is also different: (the following program runs in ipython)

import copy
a = range(100000)
%timeit -n 10 (a) # 10 runs (a)

%timeit -n 10 (a)
10 loops, best of 3: 1.55 ms per loop
10 loops, best of 3: 151 ms per loop

The -n after timeit indicates the number of runs, and the last two lines correspond to the output of the two timeits, as below. This shows that the latter is an order of magnitude slower.

4. Finding elements using dict or set

Python dict and set are both implemented using hash tables (similar to unordered_map in the c++11 standard library), and the time complexity of looking up elements is O(1).

a = range(1000)
s = set(a)
d = dict((i,1) for i in a)
%timeit -n 10000 100 in d
%timeit -n 10000 100 in s
10000 loops, best of 3: 43.5 ns per loop
10000 loops, best of 3: 49.6 ns per loop

dict is slightly more efficient (and takes up a bit more space).

5. rational use of generator (generator) and yield

%timeit -n 100 a = (i for i in range(100000))
%timeit -n 100 b = [i for i in range(100000)]
100 loops, best of 3: 1.54 ms per loop
100 loops, best of 3: 4.56 ms per loop

Using () yields a generator object that requires memory space independent of the size of the list, so it will be more efficient. In specific applications, for example, set(i for i in range(100000)) will be faster than set([i for i in range(100000)]).

But for cases where loop traversal is required:

%timeit -n 10 for x in (i for i in range(100000)): pass
%timeit -n 10 for x in [i for i in range(100000)]: pass

10 loops, best of 3: 6.51 ms per loop
10 loops, best of 3: 5.54 ms per loop

The latter is more efficient, but if there is a break in the loop, the advantage of using a generator is obvious. yield is also used to create a generator:

def yield_func(ls):
  for i in ls:
    yield i+1

def not_yield_func(ls):
  return [i+1 for i in ls]

ls = range(1000000)
%timeit -n 10 for i in yield_func(ls):pass

%timeit -n 10 for i in not_yield_func(ls):pass

10 loops, best of 3: 63.8 ms per loop
10 loops, best of 3: 62.9 ms per loop

For lists that are not very large in memory, you can return a list directly, but the readability yield is better (personal preference).
Built-in generator functions include the xrange function, the itertools package, and others.

6. Optimization cycle

What can be done outside the loop should not be put inside the loop, for example, the following optimization can be twice as fast:

a = range(10000)
size_a = len(a)
%timeit -n 1000 for i in a: k = len(a)
%timeit -n 1000 for i in a: k = size_a
1000 loops, best of 3: 569 µs per loop
1000 loops, best of 3: 256 µs per loop

7. Optimizing the order in which multiple judgment expressions are included

For AND, you should put the one that fulfills fewer conditions in front, and for OR, the one that fulfills more conditions in front. Such as:

a = range(2000) 
%timeit -n 100 [i for i in a if 10 < i < 20 or 1000 < i < 2000]
%timeit -n 100 [i for i in a if 1000 < i < 2000 or 100 < i < 20]   
%timeit -n 100 [i for i in a if i % 2 == 0 and i > 1900]
%timeit -n 100 [i for i in a if i > 1900 and i % 2 == 0]
100 loops, best of 3: 287 µs per loop
100 loops, best of 3: 214 µs per loop
100 loops, best of 3: 128 µs per loop
100 loops, best of 3: 56.1 µs per loop

8. use join to merge strings in iterators

In [1]: %%timeit
  ...: s = ''
  ...: for i in a:
  ...:     s += i
  ...:
10000 loops, best of 3: 59.8 µs per loop

In [2]: %%timeit
s = ''.join(a)
  ...:
100000 loops, best of 3: 11.8 µs per loop

JOIN has about a 5x boost for the cumulative approach.

9. Choosing the right way to format characters

s1, s2 = 'ax', 'bx'

%timeit -n 100000 'abc%s%s' % (s1, s2)
%timeit -n 100000 'abc{0}{1}'.format(s1, s2)
%timeit -n 100000 'abc' + s1 + s2
100000 loops, best of 3: 183 ns per loop
100000 loops, best of 3: 169 ns per loop
100000 loops, best of 3: 103 ns per loop

The % way is the slowest of the three, but the difference between the three isn't that big (they're all very fast). (Personally, I find % to be the best for readability)

10. Exchanging the values of two variables without the aid of intermediate variables

In [3]: %%timeit -n 10000
  a,b=1,2
  ....: c=a;a=b;b=c;
  ....:
10000 loops, best of 3: 172 ns per loop

In [4]: %%timeit -n 10000

a,b=1,2a,b=b,a
  ....:
10000 loops, best of 3: 86 ns per loop

Using a,b=b,a instead of c=a;a=b;b=c; to swap the values of a,b can be more than 1x faster.

11. Use of if is

a = range(10000)
%timeit -n 100 [i for i in a if i == True]
%timeit -n 100 [i for i in a if i is True]
100 loops, best of 3: 531 µs per loop
100 loops, best of 3: 362 µs per loop

Using if is True is nearly twice as fast as if == True.

12. Using cascade comparisons x < y < z

x, y, z = 1,2,3

%timeit -n 1000000 if x < y < z:pass
%timeit -n 1000000 if x < y and y < z:pass

1000000 loops, best of 3: 101 ns per loop
1000000 loops, best of 3: 121 ns per loop

x < y < z is slightly more efficient and more readable.

13. while 1 (particle used for comparison and "-er than") while True faster

def while_1():
  n = 100000
  while 1:
    n -= 1
    if n <= 0: break

def while_true():
  n = 100000
  while True:
    n -= 1
    if n <= 0: break

m, n = 1000000, 1000000

%timeit -n 100 while_1()
%timeit -n 100 while_true()
100 loops, best of 3: 3.69 ms per loop
100 loops, best of 3: 5.61 ms per loop

The reason while 1 is so much faster than while true is that True is a global variable, not a keyword.

14. Use ** instead of pow

%timeit -n 10000 c = pow(2,20)
%timeit -n 10000 c = 2**20

10000 loops, best of 3: 284 ns per loop
10000 loops, best of 3: 16.9 ns per loop

**It's just more than 10 times faster!

15. Use cProfile, cStringIO and cPickle to achieve the same function in c.(corresponding to profile, StringIO, pickle, respectively) package

import cPickle
import pickle
a = range(10000)
%timeit -n 100 x = (a)
%timeit -n 100 x = (a)
100 loops, best of 3: 1.58 ms per loop
100 loops, best of 3: 17 ms per loop

Packages implemented by c are more than 10 times faster!

16. Use of optimal deserialization

The following compares the efficiency of eval, cPickle, and json in deserializing the corresponding strings:

import json
import cPickle
a = range(10000)
s1 = str(a)
s2 = (a)
s3 = (a)
%timeit -n 100 x = eval(s1)
%timeit -n 100 x = (s2)
%timeit -n 100 x = (s3)
100 loops, best of 3: 16.8 ms per loop
100 loops, best of 3: 2.02 ms per loop
100 loops, best of 3: 798 µs per loop

It can be seen that json is nearly 3 times faster than cPickle and more than 20 times faster than eval.

17. Use of the C extension (Extension)

Currently there are three main ways: CPython (the most common implementation of python) native API, ctypes, Cython, and cffi, which serve to enable Python programs to call dynamic-link libraries compiled from C. Their features are:

CPython Native API.By introducing header files, Python data structures can be used directly in the corresponding C program. The implementation process is relatively tedious, but has a relatively large scope of application.
ctypes: Typically used to wrap (wrap) C programs, allowing pure Python programs to call functions in dynamic link libraries (dlls in Windows or so files in Unix). If you want to use already existing C libraries in python, using ctypes is a good choice, there are some benchmarks where python2+ctypes is the best way to perform.
Cython: Cython is a superset of CPython used to simplify the process of writing C extensions.The advantage of Cython is that it has a concise syntax and is very compatible with libraries such as numpy that contain a large number of C extensions.The scenarios in which Cython is made to work are generally optimizations for a particular algorithm or process in a project. In some tests, there can be hundreds of times performance improvement.
cffi: cffi is an implementation of ctypes in pypy (see below), which is also compatible with CPython. cffi provides a way to use C libraries in python, which allows you to write C code directly in python code, and supports linking to existing C libraries.
The use of these optimization methods is generally for the existing project performance bottleneck module optimization, can be in the case of a small number of changes to the original project to significantly improve the efficiency of the operation of the entire program.

18. Parallel programming

Because of GIL, it is difficult for Python to take full advantage of multi-core CPUs. However, the following parallel modes can be implemented with the built-in module multiprocessing:

Multi-process:For CPU-intensive programs, you can use multiprocessing Process, Pool and other encapsulated classes to achieve parallel computing through multi-processing. However, because of the communication cost of the process is relatively large, for the process requires a large number of data interaction between the program efficiency may not be a big improvement.
Multi-threading:For IO-intensive programs, the module uses the interface of multiprocessing to encapsulate threading, which makes multi-threaded programming very easy (e.g., you can use Pool's map interface, which is concise and efficient).
Distributed:The Managers class in multiprocessing provides a way to share data among different processes, allowing the development of distributed programs on this basis.
Different business scenarios can choose one or a combination of several of them to achieve optimization of program performance.

19. The ultimate killers: PyPy

PyPy is a Python implemented in RPython (a subset of CPython), which is more than 6 times faster than Python implemented in CPython, according to the benchmarking data on the official website. The reason for the fast speed is the use of Just-in-Time (JIT) compiler, i.e. dynamic compiler, which is different from static compilers (e.g. gcc, javac, etc.) in that it uses the data from the process of running the program for optimization. For historical reasons, GIL is currently retained in pypy, although the ongoing STM project attempts to turn PyPy into Python without GIL.

If the python program contains C extensions (in a non-cffi way), the JIT optimization will be much less effective and even slower than CPython (than Numpy). So it is better to use pure Python or use cffi extensions in PyPy.

With the improvement of STM, Numpy and other projects, I believe PyPy will replace CPython.

20. Use of performance analysis tools

In addition to the timeit module used in ipython above, there is also cProfile. cProfile is also very simple to use: python -m cProfile, is the name of the file in which you want to run the program, you can see the number of times each function has been called and the time it has taken to run in the standard output, so that you can find the program's performance bottleneck, and then you can optimize the program in a targeted way. optimization.

This is the whole content of this article.