I have a function defined by a combination of basic math functions (abs, cosh, sinh, exp, ...).
I was wondering if it makes a difference (in speed) to use, for example,
numpy.abs() instead of abs()?
Here are the timing results:
lebigot#weinberg ~ % python -m timeit 'abs(3.15)'
10000000 loops, best of 3: 0.146 usec per loop
lebigot#weinberg ~ % python -m timeit -s 'from numpy import abs as nabs' 'nabs(3.15)'
100000 loops, best of 3: 3.92 usec per loop
numpy.abs() is slower than abs() because it also handles Numpy arrays: it contains additional code that provides this flexibility.
However, Numpy is fast on arrays:
lebigot#weinberg ~ % python -m timeit -s 'a = [3.15]*1000' '[abs(x) for x in a]'
10000 loops, best of 3: 186 usec per loop
lebigot#weinberg ~ % python -m timeit -s 'import numpy; a = numpy.empty(1000); a.fill(3.15)' 'numpy.abs(a)'
100000 loops, best of 3: 6.47 usec per loop
(PS: '[abs(x) for x in a]' is slower in Python 2.7 than the better map(abs, a), which is about 30 % faster—which is still much slower than NumPy.)
Thus, numpy.abs() does not take much more time for 1000 elements than for 1 single float!
You should use numpy function to deal with numpy's types and use regular python function to deal with regular python types.
Worst performance usually occurs when mixing python builtins with numpy, because of types conversion. Those type conversion have been optimized lately, but it's still often better to not use them. Of course, your mileage may vary, so use profiling tools to figure out.
Also consider the use of programs like cython or making a C module if you want to optimize further your program. Or consider not to use python when performances matters.
but, when your data has been put into a numpy array, then numpy can be really fast at computing bunch of data.
In fact, on numpy array
built in abs calls numpy's implementation via __abs__, see Why built-in functions like abs works on numpy array?
So, in theory there shouldn't be much performance difference.
import timeit
x = np.random.standard_normal(10000)
def pure_abs():
return abs(x)
def numpy_abs():
return np.abs(x)
n = 10000
t1 = timeit.timeit(pure_abs, number = n)
print('Pure Python abs:', t1)
t2 = timeit.timeit(numpy_abs, number = n)
print('Numpy abs:', t2)
Pure Python abs: 0.435754060745
Numpy abs: 0.426516056061
Related
Recently, I had to convert the values of a dictionary to a list in Python 3.6 and an use case where this is supposed to happen a lot.
Trying to be a good guy I wanted to use a solution which is close to the PEP. Now, PEP 3106 suggests
list(d.keys())
which obviously works fine - but using timeit on my Windows 7 machine i see
>python -m timeit "[*{'a': 1, 'b': 2}.values()]"
1000000 loops, best of 3: 0.249 usec per loop
>python -m timeit "list({'a': 1, 'b': 2}.values())"
1000000 loops, best of 3: 0.362 usec per loop
I assume that there is an advantage in the latter version, because why else should the PEP suggest the slower one.
So here comes my question: What's the advantage of the latter version compared to the first one?
The answer is because the faster syntax was first introduced in PEP 448 in 2013, while PEP 3106 you reference was written in 2006, so even if is faster in a real way now, it didn't exist when the PEP was written.
As noted by others, the role of PEPs is not to provide a template for the fastest possible code - in general, code in PEPs will aim to be simpler and as clear as possible, because examples are generally about understanding concepts rather than achieving the best possible results, so even if the syntax did exist at the time, and is faster in a real (and reliable) way, it still may not have been used.
A bit of further testing with larger values:
python -m timeit -s "x = [1]*10000" "[*x]"
10000 loops, best of 3: 44.6 usec per loop
python -m timeit -s "x = [1]*10000" "list(x)"
10000 loops, best of 3: 44.8 usec per loop
Shows the difference isn't really two times, but rather a flat cost - I would guess it's the cost of looking up the list() built-in function. This is negligible in most real cases.
Since an assignment problem can be posed in the form of a single matrix, I am wondering if NumPy has a function to solve such a matrix. So far I have found none. Maybe one of you guys know if NumPy/SciPy has an assignment-problem-solve function?
Edit: In the meanwhile I have found a Python (not NumPy/SciPy) implementation at http://software.clapper.org/munkres/. Still I suppose a NumPy/SciPy implementation could be much faster, right?
There is now a numpy implementation of the munkres algorithm in scikit-learn under sklearn/utils/linear_assignment_.py its only dependency is numpy. I tried it with some approximately 20x20 matrices, and it seems to be about 4 times as fast as the one linked to in the question. cProfiler shows 2.517 seconds vs 9.821 seconds for 100 iterations.
I was hoping that the newer scipy.optimize.linear_sum_assignment would be fastest, but (perhaps not surprisingly) the Cython library (which does not have pip support) is significantly faster, at least for my use case:
UPDATE: using munkres v1.1.2 and scipy v1.5.0 achieves the following results:
$ python -m timeit -s "from scipy.optimize import linear_sum_assignment; import numpy as np; np.random.seed(0); c = np.random.rand(20,30)" "a,b = linear_sum_assignment(c)"
10000 loops, best of 5: 32.8 usec per loop
$ python -m timeit -s "from munkres import Munkres; import numpy as np; np.random.seed(0); c = np.random.rand(20,30); m = Munkres()" "a = m.compute(c)"
100 loops, best of 5: 2.41 msec per loop
$ python -m timeit -s "from scipy.optimize import linear_sum_assignment; import numpy as np; np.random.seed(0);" "c = np.random.rand(20,30); a,b = linear_sum_assignment(c)"
5000 loops, best of 5: 51.7 usec per loop
$ python -m timeit -s "from munkres import Munkres; import numpy as np; np.random.seed(0)" "c = np.random.rand(20,30); m = Munkres(); a = m.compute(c)"
10 loops, best of : 26 msec per loop
No, NumPy contains no such function. Combinatorial optimization is outside of NumPy's scope. It may be possible to do it with one of the optimizers in scipy.optimize but I have a feeling that the constraints may not be of the right form.
NetworkX probably also includes algorithms for assignment problems.
Yet another fast implementation, as already hinted by #Matthew: scipy.optimize has a function called linear_sum_assignment. From the docs:
The method used is the Hungarian algorithm, also known as the Munkres or Kuhn-Munkres algorithm.
https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.optimize.linear_sum_assignment.html
There is an implementation of the Munkres' algorithm as a python extension module which has numpy support. I've used it successfully on my old laptop. However, it does not work on my new machine - I assume there is a problem with "new" numpy versions (or 64bit arch).
As of version 2.4 (released 2019-10-16), NetworkX solves the problem through nx.algorithms.bipartite.minimum_weight_full_matching. At the time of writing, the implementation uses SciPy's scipy.optimize.linear_sum_assignment under the hood, so expect the same performance characteristics.
In addition to the solver in scipy.optimize.linear_sum_assignment already mentioned in some of the other answers, SciPy (as of 1.6.0) also comes with a sparsity-friendly solver in scipy.sparse.csgraph.min_weight_full_bipartite_matching.
In [2]: from scipy.sparse import random
In [3]: from scipy.sparse.csgraph import min_weight_full_bipartite_matching
In [4]: from scipy.optimize import linear_sum_assignment
In [15]: sparse = random(1000, 1000, density=0.01, format='csr')
In [16]: %timeit min_weight_full_bipartite_matching(sparse)
3.84 ms ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [17]: dense = sparse.toarray()
In [18]: %timeit linear_sum_assignment(dense)
18.8 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I've found that max is slower than the sort function in Python 2 and 3.
Python 2
$ python -m timeit -s 'import random;a=range(10000);random.shuffle(a)' 'a.sort();a[-1]'
1000 loops, best of 3: 239 usec per loop
$ python -m timeit -s 'import random;a=range(10000);random.shuffle(a)' 'max(a)'
1000 loops, best of 3: 342 usec per loop
Python 3
$ python3 -m timeit -s 'import random;a=list(range(10000));random.shuffle(a)' 'a.sort();a[-1]'
1000 loops, best of 3: 252 usec per loop
$ python3 -m timeit -s 'import random;a=list(range(10000));random.shuffle(a)' 'max(a)'
1000 loops, best of 3: 371 usec per loop
Why is max (O(n)) slower than the sort function (O(nlogn))?
You have to be very careful when using the timeit module in Python.
python -m timeit -s 'import random;a=range(10000);random.shuffle(a)' 'a.sort();a[-1]'
Here the initialisation code runs once to produce a randomised array a. Then the rest of the code is run several times. The first time it sorts the array, but every other time you are calling the sort method on an already sorted array. Only the fastest time is returned, so you are actually timing how long it takes Python to sort an already sorted array.
Part of Python's sort algorithm is to detect when the array is already partly or completely sorted. When completely sorted it simply has to scan once through the array to detect this and then it stops.
If instead you tried:
python -m timeit -s 'import random;a=range(100000);random.shuffle(a)' 'sorted(a)[-1]'
then the sort happens on every timing loop and you can see that the time for sorting an array is indeed much longer than to just find the maximum value.
Edit: #skyking's answer explains the part I left unexplained: a.sort() knows it is working on a list so can directly access the elements. max(a) works on any arbitrary iterable so has to use generic iteration.
First off, note that max() uses the iterator protocol, while list.sort() uses ad-hoc code. Clearly, using an iterator is an important overhead, that's why you are observing that difference in timings.
However, apart from that, your tests are not fair. You are running a.sort() on the same list more than once. The algorithm used by Python is specifically designed to be fast for already (partially) sorted data. Your tests are saying that the algorithm is doing its job well.
These are fair tests:
$ python3 -m timeit -s 'import random;a=list(range(10000));random.shuffle(a)' 'max(a[:])'
1000 loops, best of 3: 227 usec per loop
$ python3 -m timeit -s 'import random;a=list(range(10000));random.shuffle(a)' 'a[:].sort()'
100 loops, best of 3: 2.28 msec per loop
Here I'm creating a copy of the list every time. As you can see, the order of magnitude of the results are different: micro- vs milliseconds, as we would expect.
And remember: big-Oh specifies an upper bound! The lower bound for Python's sorting algorithm is Ω(n). Being O(n log n) does not automatically imply that every run takes a time proportional to n log n. It does not even imply that it needs to be slower than a O(n) algorithm, but that's another story. What's important to understand is that in some favorable cases, an O(n log n) algorithm may run in O(n) time or less.
This could be because l.sort is a member of list while max is a generic function. This means that l.sort can rely on the internal representation of list while max will have to go through generic iterator protocol.
This makes that each element fetch for l.sort is faster than each element fetch that max does.
I assume that if you instead use sorted(a) you will get the result slower than max(a).
I ran some simple tests on abs() and fabs() functions and I don't understand what are the advantages of using fabs(), if it is:
1) slower
2) works only on floats
3) will throw an exception if used on a different type
In [1]: %timeit abs(5)
10000000 loops, best of 3: 86.5 ns per loop
In [3]: %timeit fabs(5)
10000000 loops, best of 3: 115 ns per loop
In [4]: %timeit abs(-5)
10000000 loops, best of 3: 88.3 ns per loop
In [5]: %timeit fabs(-5)
10000000 loops, best of 3: 114 ns per loop
In [6]: %timeit abs(5.0)
10000000 loops, best of 3: 92.5 ns per loop
In [7]: %timeit fabs(5.0)
10000000 loops, best of 3: 93.2 ns per loop
it's even slower on floats!
From where I am standing the only advantage of using fabs() is to make your code more readable, because by using it, you are clearly stating your intention of working with float/double point values
Is there any other use of fabs()?
From an email response from Tim Peters:
Why does math have an fabs function? Both it and the abs builtin function
wind up calling fabs() for floats. abs() is faster to boot.
Nothing deep -- the math module supplies everything in C89's standard
libm (+ a few extensions), fabs() is a std C89 libm function.
There isn't a clear (to me) reason why one would be faster than the
other; sounds accidental; math.fabs() could certainly be made faster
(as currently implemented (via math_1), it endures a pile of
general-purpose "try to guess whether libm should have set errno"
boilerplate that's wasted (there are no domain or range errors
possible for fabs())).
It seems there is no advantageous reason to use fabs. Just use abs for virtually all purposes.
i personally had an issue with my gcc compiler in c++ , when using abs it returns always an integer and not a double even if the result is a double , it was really a big issue for me at that time because it did not occur to me that abs could be a problem (i mean it is not obvious and easy to think that way). but i tried accidently using fabs and the issue is solved , now my program runs perfectly .
Since an assignment problem can be posed in the form of a single matrix, I am wondering if NumPy has a function to solve such a matrix. So far I have found none. Maybe one of you guys know if NumPy/SciPy has an assignment-problem-solve function?
Edit: In the meanwhile I have found a Python (not NumPy/SciPy) implementation at http://software.clapper.org/munkres/. Still I suppose a NumPy/SciPy implementation could be much faster, right?
There is now a numpy implementation of the munkres algorithm in scikit-learn under sklearn/utils/linear_assignment_.py its only dependency is numpy. I tried it with some approximately 20x20 matrices, and it seems to be about 4 times as fast as the one linked to in the question. cProfiler shows 2.517 seconds vs 9.821 seconds for 100 iterations.
I was hoping that the newer scipy.optimize.linear_sum_assignment would be fastest, but (perhaps not surprisingly) the Cython library (which does not have pip support) is significantly faster, at least for my use case:
UPDATE: using munkres v1.1.2 and scipy v1.5.0 achieves the following results:
$ python -m timeit -s "from scipy.optimize import linear_sum_assignment; import numpy as np; np.random.seed(0); c = np.random.rand(20,30)" "a,b = linear_sum_assignment(c)"
10000 loops, best of 5: 32.8 usec per loop
$ python -m timeit -s "from munkres import Munkres; import numpy as np; np.random.seed(0); c = np.random.rand(20,30); m = Munkres()" "a = m.compute(c)"
100 loops, best of 5: 2.41 msec per loop
$ python -m timeit -s "from scipy.optimize import linear_sum_assignment; import numpy as np; np.random.seed(0);" "c = np.random.rand(20,30); a,b = linear_sum_assignment(c)"
5000 loops, best of 5: 51.7 usec per loop
$ python -m timeit -s "from munkres import Munkres; import numpy as np; np.random.seed(0)" "c = np.random.rand(20,30); m = Munkres(); a = m.compute(c)"
10 loops, best of : 26 msec per loop
No, NumPy contains no such function. Combinatorial optimization is outside of NumPy's scope. It may be possible to do it with one of the optimizers in scipy.optimize but I have a feeling that the constraints may not be of the right form.
NetworkX probably also includes algorithms for assignment problems.
Yet another fast implementation, as already hinted by #Matthew: scipy.optimize has a function called linear_sum_assignment. From the docs:
The method used is the Hungarian algorithm, also known as the Munkres or Kuhn-Munkres algorithm.
https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.optimize.linear_sum_assignment.html
There is an implementation of the Munkres' algorithm as a python extension module which has numpy support. I've used it successfully on my old laptop. However, it does not work on my new machine - I assume there is a problem with "new" numpy versions (or 64bit arch).
As of version 2.4 (released 2019-10-16), NetworkX solves the problem through nx.algorithms.bipartite.minimum_weight_full_matching. At the time of writing, the implementation uses SciPy's scipy.optimize.linear_sum_assignment under the hood, so expect the same performance characteristics.
In addition to the solver in scipy.optimize.linear_sum_assignment already mentioned in some of the other answers, SciPy (as of 1.6.0) also comes with a sparsity-friendly solver in scipy.sparse.csgraph.min_weight_full_bipartite_matching.
In [2]: from scipy.sparse import random
In [3]: from scipy.sparse.csgraph import min_weight_full_bipartite_matching
In [4]: from scipy.optimize import linear_sum_assignment
In [15]: sparse = random(1000, 1000, density=0.01, format='csr')
In [16]: %timeit min_weight_full_bipartite_matching(sparse)
3.84 ms ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [17]: dense = sparse.toarray()
In [18]: %timeit linear_sum_assignment(dense)
18.8 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)