In Python, theoretically, which method should be faster out of test1 and test2 (assuming same value of x). I have tried using %timeit but see very little difference.
import numpy as np
class Tester():
def __init__(self):
self.x = np.arange(100000)
def test1(self):
return np.sum(self.x * self.x )
def test2(self,x):
return np.sum(x*x)
In any implementation of Python, the time will be overwhelmingly dominated by the multiplication of two vectors with 100,000 elements each. Everything else is noise compared to that. Make the vector much smaller if you're really interested in measuring other overheads.
In CPython, test2() will most likely be a little faster. It has an "extra" argument, but arguments are unpacked "at C speed" so that doesn't matter much. Arguments are accessed the same way as local variables, via the LOAD_FAST opcode, which is a simple array[index] access.
In test1(), each instance of self.x causes the string "x" to be looked up in the dictionary self.__dict__. That's slower than an indexed array access. But compared to the time taken by the long-winded multiplication, it's basically nothing.
I know this sort of misses the point of the question, but since you tagged the question with numpy and are looking at speed differences for a large array, I thought I would mention that there are faster solutions would be something else entirely.
So, what you're doing is a dot product, so use numpy.dot, which is built with the multiplying and summing all together from an external library (LAPACK?) (For convenience I'll use the syntax of test1, despite #Tim's answer, because no extra argument needs to be passed.)
def test3(self):
return np.dot(self.x, self.x)
or possibly even faster (and certainly more general):
def test4(self):
return np.einsum('i,i->', self.x, self.x)
Here are some tests:
In [363]: paste
class Tester():
def __init__(self, n):
self.x = np.arange(n)
def test1(self):
return np.sum(self.x * self.x)
def test2(self, x):
return np.sum(x*x)
def test3(self):
return np.dot(self.x, self.x)
def test4(self):
return np.einsum('i,i->', self.x, self.x)
## -- End pasted text --
In [364]: t = Tester(10000)
In [365]: np.allclose(t.test1(), [t.test2(t.x), t.test3(), t.test4()])
Out[365]: True
In [366]: timeit t.test1()
10000 loops, best of 3: 37.4 µs per loop
In [367]: timeit t.test2(t.x)
10000 loops, best of 3: 37.4 µs per loop
In [368]: timeit t.test3()
100000 loops, best of 3: 15.2 µs per loop
In [369]: timeit t.test4()
100000 loops, best of 3: 16.5 µs per loop
In [370]: t = Tester(10)
In [371]: timeit t.test1()
100000 loops, best of 3: 16.6 µs per loop
In [372]: timeit t.test2(t.x)
100000 loops, best of 3: 16.5 µs per loop
In [373]: timeit t.test3()
100000 loops, best of 3: 3.14 µs per loop
In [374]: timeit t.test4()
100000 loops, best of 3: 6.26 µs per loop
And speaking of small, almost syntactic, speed differences, think of using a method rather than standalone function:
def test1b(self):
return (self.x*self.x).sum()
gives:
In [385]: t = Tester(10000)
In [386]: timeit t.test1()
10000 loops, best of 3: 40.6 µs per loop
In [387]: timeit t.test1b()
10000 loops, best of 3: 37.3 µs per loop
In [388]: t = Tester(3)
In [389]: timeit t.test1()
100000 loops, best of 3: 16.6 µs per loop
In [390]: timeit t.test1b()
100000 loops, best of 3: 14.2 µs per loop
Related
consider the array a
a = np.array([3, 3, np.nan, 3, 3, np.nan])
I could do
np.isnan(a).argmax()
But this requires finding all np.nan just to find the first.
Is there a more efficient way?
I've been trying to figure out if I can pass a parameter to np.argpartition such that np.nan get's sorted first as opposed to last.
EDIT regarding [dup].
There are several reasons this question is different.
That question and answers addressed equality of values. This is in regards to isnan.
Those answers all suffer from the same issue my answer faces. Note, I provided a perfectly valid answer but highlighted it's inefficiency. I'm looking to fix the inefficiency.
EDIT regarding second [dup].
Still addressing equality and question/answers are old and very possibly outdated.
It might also be worth to look into numba.jit; without it, the vectorized version will likely beat a straight-forward pure-Python search in most scenarios, but after compiling the code, the ordinary search will take the lead, at least in my testing:
In [63]: a = np.array([np.nan if i % 10000 == 9999 else 3 for i in range(100000)])
In [70]: %paste
import numba
def naive(a):
for i in range(len(a)):
if np.isnan(a[i]):
return i
def short(a):
return np.isnan(a).argmax()
#numba.jit
def naive_jit(a):
for i in range(len(a)):
if np.isnan(a[i]):
return i
#numba.jit
def short_jit(a):
return np.isnan(a).argmax()
## -- End pasted text --
In [71]: %timeit naive(a)
100 loops, best of 3: 7.22 ms per loop
In [72]: %timeit short(a)
The slowest run took 4.59 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 37.7 µs per loop
In [73]: %timeit naive_jit(a)
The slowest run took 6821.16 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 6.79 µs per loop
In [74]: %timeit short_jit(a)
The slowest run took 395.51 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 144 µs per loop
Edit: As pointed out by #hpaulj in their answer, numpy actually ships with an optimized short-circuited search whose performance is comparable with the JITted search above:
In [26]: %paste
def plain(a):
return a.argmax()
#numba.jit
def plain_jit(a):
return a.argmax()
## -- End pasted text --
In [35]: %timeit naive(a)
100 loops, best of 3: 7.13 ms per loop
In [36]: %timeit plain(a)
The slowest run took 4.37 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 7.04 µs per loop
In [37]: %timeit naive_jit(a)
100000 loops, best of 3: 6.91 µs per loop
In [38]: %timeit plain_jit(a)
10000 loops, best of 3: 125 µs per loop
I'll nominate
a.argmax()
With #fuglede's test array:
In [1]: a = np.array([np.nan if i % 10000 == 9999 else 3 for i in range(100000)])
In [2]: np.isnan(a).argmax()
Out[2]: 9999
In [3]: np.argmax(a)
Out[3]: 9999
In [4]: a.argmax()
Out[4]: 9999
In [5]: timeit a.argmax()
The slowest run took 29.94 ....
10000 loops, best of 3: 20.3 µs per loop
In [6]: timeit np.isnan(a).argmax()
The slowest run took 7.82 ...
1000 loops, best of 3: 462 µs per loop
I don't have numba installed, so can compare that. But my speedup relative to short is greater than #fuglede's 6x.
I'm testing in Py3, which accepts <np.nan, while Py2 raises a runtime warning. But the code search suggests this isn't dependent on that comparison.
/numpy/core/src/multiarray/calculation.c PyArray_ArgMax plays with axes (moving the one of interest to the end), and delegates the action to arg_func = PyArray_DESCR(ap)->f->argmax, a function that depends on the dtype.
In numpy/core/src/multiarray/arraytypes.c.src it looks like BOOL_argmax short circuits, returning as soon as it encounters a True.
for (; i < n; i++) {
if (ip[i]) {
*max_ind = i;
return 0;
}
}
And #fname#_argmax also short circuits on maximal nan. np.nan is 'maximal' in argmin as well.
#if #isfloat#
if (#isnan#(mp)) {
/* nan encountered; it's maximal */
return 0;
}
#endif
Comments from experienced c coders are welcomed, but it appears to me that at least for np.nan, a plain argmax will be as fast you we can get.
Playing with the 9999 in generating a shows that the a.argmax time depends on that value, consistent with short circuiting.
Here is a pythonic approach using itertools.takewhile():
from itertools import takewhile
sum(1 for _ in takewhile(np.isfinite, a))
Benchmark with generator_expression_within_next approach: 1
In [118]: a = np.repeat(a, 10000)
In [120]: %timeit next(i for i, j in enumerate(a) if np.isnan(j))
100 loops, best of 3: 12.4 ms per loop
In [121]: %timeit sum(1 for _ in takewhile(np.isfinite, a))
100 loops, best of 3: 11.5 ms per loop
But still (by far) slower than numpy approach:
In [119]: %timeit np.isnan(a).argmax()
100000 loops, best of 3: 16.8 µs per loop
1. The problem with this approach is using enumerate function. Which returns an enumerate object from the numpy array first (which is an iterator like object) and calling the generator function and next attribute of the iterator will take time.
When looking for the first match in various scenarios, we could iterate through and look for the first match and exit out on the first match rather than going/processing the entire array. So, we would have an approach using Python's next function , like so -
next((i for i, val in enumerate(a) if np.isnan(val)))
Sample runs -
In [192]: a = np.array([3, 3, np.nan, 3, 3, np.nan])
In [193]: next((i for i, val in enumerate(a) if np.isnan(val)))
Out[193]: 2
In [194]: a[2] = 10
In [195]: next((i for i, val in enumerate(a) if np.isnan(val)))
Out[195]: 5
I'm reading through the Python Docs, and, under Section 8.4.1,
I found the following __init__ definition (abbreviated):
class ListBasedSet(collections.abc.Set):
''' Alternate set implementation favoring space over speed
and not requiring the set elements to be hashable. '''
def __init__(self, iterable):
self.elements = lst = []
for value in iterable:
if value not in lst:
lst.append(value)
The part I don't get is the self.elements = lst = [] line. Why the double assignment?
Adding some print statements:
def __init__(self, iterable):
self.elements = lst = []
print('elements id:', id(self.elements))
print('lst id:', id(lst))
for value in iterable:
if value not in lst:
lst.append(value)
Declaring one:
ListBasedSet(range(3))
elements id: 4741984136
lst id: 4741984136
Out[36]: <__main__.ListBasedSet at 0x11ab12fd0>
As expected, they both point to the same PyObject.
Is brevity the only reason to do something like this? If not, why? Something to do with reentrancy?
I'd call this a case of premature optimization; you don't save that much by eliminating the dot, especially for large input iterables; here's some timings:
Eliminating the dot:
%timeit ListBasedSet(range(3))
The slowest run took 4.06 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.05 µs per loop
%timeit ListBasedSet(range(30))
100000 loops, best of 3: 18.5 µs per loop
%timeit ListBasedSet(range(3000))
10 loops, best of 3: 119 ms per loop
While, with the dot (i.e replace lst with self.elements:
%timeit ListBasedSet(range(3))
The slowest run took 5.97 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.48 µs per loop
%timeit ListBasedSet(range(30))
10000 loops, best of 3: 22.8 µs per loop
%timeit ListBasedSet(range(3000))
10 loops, best of 3: 118 ms per loop
As you can see, as we increase the size of the input iterable, the difference in time pretty much disappears, the appending and membership testing pretty much cover any gains.
I am looking for a more efficient way to do the equivalent of
myarray * (2**arange(len(myarray))
Essentially I am after something like numpy.packbits that packs the bits into a single integer for any reasonable sized myarray yielding an appropriate size integer. I can implement this using numpy.packbits but I was wandering there is already a builtin that does this.
Three versions:
from numpy import *
from numba import jit
myarray=random.randint(0,2,64).astype(uint64)
def convert1(arr) : return (arr*(2**arange(arr.size,dtype=uint64))).sum()
pow2=2**arange(64,dtype=uint64)
def convert2(arr) : return (arr*pow2[:arr.size]).sum()
#jit("uint64(uint64[:])")
def convert3(arr):
m=1
y=0
for i in range(arr.size):
y=y + pow2[i] * arr[i]
return y
with times:
In [44]: %timeit convert1(myarray)
10000 loops, best of 3: 62.7 µs per loop
In [45]: %timeit convert2(myarray)
10000 loops, best of 3: 11.6 µs per loop
In [46]: %timeit convert3(myarray)
1000000 loops, best of 3: 1.55 µs per loop
Precomputing and Numba allow big improvements.
Consider the following two functions, which essentially multiply every number in a small sequence with every number in a larger sequence to build up a 2D array, and then doubles all the values in the array. noloop() uses direct multiplication of 2D numpy arrays and returns the result, whereas loop() uses a for loop to iterate over arr1 and gradually build up an output array.
import numpy as np
arr1 = np.random.rand(100, 1)
arr2 = np.random.rand(1, 100000)
def noloop():
return (arr1*arr2)*2
def loop():
out = np.empty((arr1.size, arr2.size))
for i in range(arr1.size):
tmp = (arr1[i]*arr2)*2
out[i] = tmp.reshape(tmp.size)
return out
I expected noloop to be much faster even for a small number of iterations, but for the array sizes above, loop is actually faster:
>>> %timeit noloop()
10 loops, best of 3: 64.7 ms per loop
>>> %timeit loop()
10 loops, best of 3: 41.6 ms per loop
And interestingly, if I remove *2 in both functions, noloop is faster, but only slightly:
>>> %timeit noloop()
10 loops, best of 3: 29.4 ms per loop
>>> %timeit loop()
10 loops, best of 3: 34.4 ms per loop
Is there a good explanation for these results, and is there a notably faster way to perform the same task?
I wasn't able to reproduce your results, but I did find that I could get substantial speed up (factor of 2) using numpy.multiply. By using the out argument you can take advantage of the fact that the memory is already allocated and eliminate the copying of tmp to out.
def out_loop():
out = np.empty((arr1.size, arr2.size))
for i in range(arr1.size):
np.multiply(arr1[i], arr2, out=out[i].reshape((1, arr2.size)))
out[i] *= 2
return out
Results on my machine:
In [32]: %timeit out_loop()
100 loops, best of 3: 17.7 ms per loop
In [33]: %timeit loop()
10 loops, best of 3: 28.3 ms per loop
In python docs I can see that deque is a special collection highly optimized for poping/adding items from left or right sides. E.g. documentation says:
Deques are a generalization of stacks and queues (the name is
pronounced “deck” and is short for “double-ended queue”). Deques
support thread-safe, memory efficient appends and pops from either
side of the deque with approximately the same O(1) performance in
either direction.
Though list objects support similar operations, they are optimized for
fast fixed-length operations and incur O(n) memory movement costs for
pop(0) and insert(0, v) operations which change both the size and
position of the underlying data representation.
I decided to make some comparisons using ipython. Could anyone explain me what I did wrong here:
In [31]: %timeit range(1, 10000).pop(0)
10000 loops, best of 3: 114 us per loop
In [32]: %timeit deque(xrange(1, 10000)).pop()
10000 loops, best of 3: 181 us per loop
In [33]: %timeit deque(range(1, 10000)).pop()
1000 loops, best of 3: 243 us per loop
Could anyone explain me what I did wrong here
Yes, your timing is dominated by the time to create the list or deque. The time to do the pop is insignificant in comparison.
Instead you should isolate the thing you're trying to test (the pop speed) from the setup time:
In [1]: from collections import deque
In [2]: s = list(range(1000))
In [3]: d = deque(s)
In [4]: s_append, s_pop = s.append, s.pop
In [5]: d_append, d_pop = d.append, d.pop
In [6]: %timeit s_pop(); s_append(None)
10000000 loops, best of 3: 115 ns per loop
In [7]: %timeit d_pop(); d_append(None)
10000000 loops, best of 3: 70.5 ns per loop
That said, the real differences between deques and list in terms of performance are:
Deques have O(1) speed for appendleft() and popleft() while lists have O(n) performance for insert(0, value) and pop(0).
List append performance is hit and miss because it uses realloc() under the hood. As a result, it tends to have over-optimistic timings in simple code (because the realloc doesn't have to move data) and really slow timings in real code (because fragmentation forces realloc to move all the data). In contrast, deque append performance is consistent because it never reallocs and never moves data.
For what it is worth:
Python 3
deque.pop vs list.pop
> python3 -mtimeit -s 'import collections' -s 'items = range(10000000); base = [*items]' -s 'c = collections.deque(base)' 'c.pop()'
5000000 loops, best of 5: 46.5 nsec per loop
> python3 -mtimeit -s 'import collections' -s 'items = range(10000000); base = [*items]' 'base.pop()'
5000000 loops, best of 5: 55.1 nsec per loop
deque.appendleft vs list.insert
> python3 -mtimeit -s 'import collections' -s 'c = collections.deque()' 'c.appendleft(1)'
5000000 loops, best of 5: 52.1 nsec per loop
> python3 -mtimeit -s 'c = []' 'c.insert(0, 1)'
50000 loops, best of 5: 12.1 usec per loop
Python 2
> python -mtimeit -s 'import collections' -s 'c = collections.deque(xrange(1, 100000000))' 'c.pop()'
10000000 loops, best of 3: 0.11 usec per loop
> python -mtimeit -s 'c = range(1, 100000000)' 'c.pop()'
10000000 loops, best of 3: 0.174 usec per loop
> python -mtimeit -s 'import collections' -s 'c = collections.deque()' 'c.appendleft(1)'
10000000 loops, best of 3: 0.116 usec per loop
> python -mtimeit -s 'c = []' 'c.insert(0, 1)'
100000 loops, best of 3: 36.4 usec per loop
As you can see, where it really shines is in appendleft vs insert.
I would recommend you to refer
https://wiki.python.org/moin/TimeComplexity
Python lists and deque have simlilar complexities for most operations(push,pop etc.)
I found my way to this question and thought I'd offer up an example with a little context.
A classic use-case for using a Deque might be rotating/shifting elements in a collection because (as others have mentioned), you get very good (O(1)) complexity for push/pop operations on both ends because these operations are just moving references around as opposed to a list which has to physically move objects around in memory.
So here are 2 very similar-looking implementations of a rotate-left function:
def rotate_with_list(items, n):
l = list(items)
for _ in range(n):
l.append(l.pop(0))
return l
from collections import deque
def rotate_with_deque(items, n):
d = deque(items)
for _ in range(n):
d.append(d.popleft())
return d
Note: This is such a common use of a deque that the deque has a built-in rotate method, but I'm doing it manually here for the sake of visual comparison.
Now let's %timeit.
In [1]: def rotate_with_list(items, n):
...: l = list(items)
...: for _ in range(n):
...: l.append(l.pop(0))
...: return l
...:
...: from collections import deque
...: def rotate_with_deque(items, n):
...: d = deque(items)
...: for _ in range(n):
...: d.append(d.popleft())
...: return d
...:
In [2]: items = range(100000)
In [3]: %timeit rotate_with_list(items, 800)
100 loops, best of 3: 17.8 ms per loop
In [4]: %timeit rotate_with_deque(items, 800)
The slowest run took 5.89 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 527 µs per loop
In [5]: %timeit rotate_with_list(items, 8000)
10 loops, best of 3: 174 ms per loop
In [6]: %timeit rotate_with_deque(items, 8000)
The slowest run took 8.99 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.1 ms per loop
In [7]: more_items = range(10000000)
In [8]: %timeit rotate_with_list(more_items, 800)
1 loop, best of 3: 4.59 s per loop
In [9]: %timeit rotate_with_deque(more_items, 800)
10 loops, best of 3: 109 ms per loop
Pretty interesting how both data structures expose an eerily similar interface but have drastically different performance :)
out of curiosity I tried inserting in beginning in list vs appendleft() of deque.
clearly deque is winner.