Related
So without telling a really long story I was working on some code where I was reading in some data from a binary file and then looping over every single point using a for loop. So I completed the code and it was running ridiculously slow. I was looping over around 60,000 points from around 128 data channels and this was taking a minute or more to process. This was way slower than I ever expected Python to run. So I made the whole thing more efficient by using Numpy but in trying to figure out why the original process ran so slow we were doing some type checking and found that I was looping over Numpy arrays instead of Python lists. OK no major deal to make the inputs to our test setup the same I converted the Numpy arrays to lists before looping. Bang the same slow code that took a minute to run now took 10 seconds. I was floored. The only think I did was change a Numpy array to a Python list I changed it back and it was slow as mud again. I couldn't believe it so I went to get more definitive proof
$ python -m timeit -s "import numpy" "for k in numpy.arange(5000): k+1"
100 loops, best of 3: 5.46 msec per loop
$ python -m timeit "for k in range(5000): k+1"
1000 loops, best of 3: 256 usec per loop
What is going on? I know that Numpy arrays and and Python list are different but why is it so much slower to iterate over every point in an array?
I observed this behavior in both Python 2.6 and 2.7 running Numpy 10.1 I believe.
We can do a little sleuthing to figure this out:
>>> import numpy as np
>>> a = np.arange(32)
>>> a
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31])
>>> a.data
<read-write buffer for 0x107d01e40, size 256, offset 0 at 0x107d199b0>
>>> id(a.data)
4433424176
>>> id(a[0])
4424950096
>>> id(a[1])
4424950096
>>> for item in a:
... print id(item)
...
4424950096
4424950120
4424950096
4424950120
4424950096
4424950120
4424950096
4424950120
4424950096
4424950120
4424950096
4424950120
4424950096
4424950120
4424950096
4424950120
4424950096
4424950120
4424950096
4424950120
4424950096
4424950120
4424950096
4424950120
4424950096
4424950120
4424950096
4424950120
4424950096
4424950120
4424950096
4424950120
So what is going on here? First, I took a look at the memory location of the array's memory buffer. It's at 4433424176. That in itself isn't too illuminating. However, numpy stores it's data as a contiguous C array, so the first element in the numpy array should correspond to the memory address of the array itself, but it doesn't:
>>> id(a[0])
4424950096
and it's a good thing it doesn't because that would break the invariant in python that 2 objects never have the same id during their lifetimes.
So, how does numpy accomplish this? Well, the answer is that numpy has to wrap the returned object with a python type (e.g. numpy.float64 or numpy.int64 in this case) which takes time if you're iterating item-by-item1. Further proof of this is demonstrated when iterating -- We see that we're alternating between 2 separate IDs while iterating over the array. This means that python's memory allocator and garbage collector are working overtime to create new objects and then free them.
A list doesn't have this memory allocator/garbage collector overhead. The objects in the list already exist as python objects (and they'll still exist after iteration), so neither plays any role in the iteration over a list.
Timing methodology:
Also note, your timings are thrown off a little bit by your assumptions. You were assuming that k + 1 should take about the same amount of time in both cases, but it doesn't. Notice if I repeat your timings without doing any addition:
mgilson$ python -m timeit -s "import numpy" "for k in numpy.arange(5000): k"
1000 loops, best of 3: 233 usec per loop
mgilson$ python -m timeit "for k in range(5000): k"
10000 loops, best of 3: 114 usec per loop
there's only about a factor of 2 difference. Doing the addition however leads to a factor of 5 difference or so:
mgilson$ python -m timeit "for k in range(5000): k+1"
10000 loops, best of 3: 179 usec per loop
mgilson$ python -m timeit -s "import numpy" "for k in numpy.arange(5000): k+1"
1000 loops, best of 3: 786 usec per loop
For fun, lets just do the addition:
$ python -m timeit -s "v = 1" "v + 1"
10000000 loops, best of 3: 0.0261 usec per loop
mgilson$ python -m timeit -s "import numpy; v = numpy.int64(1)" "v + 1"
10000000 loops, best of 3: 0.121 usec per loop
And finally, your timeit also includes list/array construction time which isn't ideal:
mgilson$ python -m timeit -s "v = range(5000)" "for k in v: k"
10000 loops, best of 3: 80.2 usec per loop
mgilson$ python -m timeit -s "import numpy; v = numpy.arange(5000)" "for k in v: k"
1000 loops, best of 3: 237 usec per loop
Notice that numpy actually got further away from the list solution in this case. This shows that iteration really is slower and you might get some speedups if you convert the numpy types to standard python types.
1Note, this doesn't take a lot of time when slicing because that only has to allocate O(1) new objects since numpy returns a view into the original array.
Using python 2.7
Here are my speeds along with xrange:
python -m timeit -s "import numpy" "for k in numpy.arange(5000): k+1"
1000 loops, best of 3: 1.22 msec per loop
python -m timeit "for k in range(5000): k+1"
10000 loops, best of 3: 186 usec per loop
python -m timeit "for k in xrange(5000): k+1"
10000 loops, best of 3: 161 usec per loop
Numpy is noticeibly slower because it's iterating over a numpy-specific array. This is not its primarily intended function. In many cases, they should be treated more like a monolithic collection of numbers as opposed to simple lists/iterables. For example, if we have a rather large-ish python list of numbers that we want to raise to the third power, we might do something like this:
python -m timeit "lst1 = [x for x in range(100000)];" "lst2 = map(lambda x: x**3, lst1)"
10 loops, best of 3: 125 msec per loop
Note: the lst1 represents an arbitrary list. I'm aware you can speed this up within the original lambda by doing x**3 for x in range, but this is dealign with a list that should already exist and may very well not be sequential.
Anyway, numpy is meant to be treated as an array would be:
python -m timeit -s "import numpy" "lst1 = numpy.arange(100000)" "lst2 = lst1**2"
10000 loops, best of 3: 120 usec per loop
Say you had two lists of arbitrary values, each of which you want to multiply together. In vanilla python, you might do:
python -m timeit -s "lst1 = [x for x in xrange(0, 10000, 2)]" "lst2 = [x for x in xrange(2, 10002, 2)]" "lst3 = [x*y for x,y in zip(lst1, lst2)]"
1000 loops, best of 3: 736 usec per loop
And in Numpy:
python -m timeit -s "import numpy" "lst1 = numpy.arange(0, 10000, 2)" "lst2 = numpy.arange(2, 10002, 2)" "lst3 = lst1*lst2"
100000 loops, best of 3: 10.9 usec per loop
In these last two examples, NumPy skyrockets ahead as the clear winner. For simple iteration over a list, range or xrange is perfectly sufficient, but your example does not take into account the true purpose of Numpy arrays. It's comparing planes and cars; yeah, planes are generally faster for what they are intended to do, but trying to fly to your local supermarket is not prudent.
In python docs I can see that deque is a special collection highly optimized for poping/adding items from left or right sides. E.g. documentation says:
Deques are a generalization of stacks and queues (the name is
pronounced “deck” and is short for “double-ended queue”). Deques
support thread-safe, memory efficient appends and pops from either
side of the deque with approximately the same O(1) performance in
either direction.
Though list objects support similar operations, they are optimized for
fast fixed-length operations and incur O(n) memory movement costs for
pop(0) and insert(0, v) operations which change both the size and
position of the underlying data representation.
I decided to make some comparisons using ipython. Could anyone explain me what I did wrong here:
In [31]: %timeit range(1, 10000).pop(0)
10000 loops, best of 3: 114 us per loop
In [32]: %timeit deque(xrange(1, 10000)).pop()
10000 loops, best of 3: 181 us per loop
In [33]: %timeit deque(range(1, 10000)).pop()
1000 loops, best of 3: 243 us per loop
Could anyone explain me what I did wrong here
Yes, your timing is dominated by the time to create the list or deque. The time to do the pop is insignificant in comparison.
Instead you should isolate the thing you're trying to test (the pop speed) from the setup time:
In [1]: from collections import deque
In [2]: s = list(range(1000))
In [3]: d = deque(s)
In [4]: s_append, s_pop = s.append, s.pop
In [5]: d_append, d_pop = d.append, d.pop
In [6]: %timeit s_pop(); s_append(None)
10000000 loops, best of 3: 115 ns per loop
In [7]: %timeit d_pop(); d_append(None)
10000000 loops, best of 3: 70.5 ns per loop
That said, the real differences between deques and list in terms of performance are:
Deques have O(1) speed for appendleft() and popleft() while lists have O(n) performance for insert(0, value) and pop(0).
List append performance is hit and miss because it uses realloc() under the hood. As a result, it tends to have over-optimistic timings in simple code (because the realloc doesn't have to move data) and really slow timings in real code (because fragmentation forces realloc to move all the data). In contrast, deque append performance is consistent because it never reallocs and never moves data.
For what it is worth:
Python 3
deque.pop vs list.pop
> python3 -mtimeit -s 'import collections' -s 'items = range(10000000); base = [*items]' -s 'c = collections.deque(base)' 'c.pop()'
5000000 loops, best of 5: 46.5 nsec per loop
> python3 -mtimeit -s 'import collections' -s 'items = range(10000000); base = [*items]' 'base.pop()'
5000000 loops, best of 5: 55.1 nsec per loop
deque.appendleft vs list.insert
> python3 -mtimeit -s 'import collections' -s 'c = collections.deque()' 'c.appendleft(1)'
5000000 loops, best of 5: 52.1 nsec per loop
> python3 -mtimeit -s 'c = []' 'c.insert(0, 1)'
50000 loops, best of 5: 12.1 usec per loop
Python 2
> python -mtimeit -s 'import collections' -s 'c = collections.deque(xrange(1, 100000000))' 'c.pop()'
10000000 loops, best of 3: 0.11 usec per loop
> python -mtimeit -s 'c = range(1, 100000000)' 'c.pop()'
10000000 loops, best of 3: 0.174 usec per loop
> python -mtimeit -s 'import collections' -s 'c = collections.deque()' 'c.appendleft(1)'
10000000 loops, best of 3: 0.116 usec per loop
> python -mtimeit -s 'c = []' 'c.insert(0, 1)'
100000 loops, best of 3: 36.4 usec per loop
As you can see, where it really shines is in appendleft vs insert.
I would recommend you to refer
https://wiki.python.org/moin/TimeComplexity
Python lists and deque have simlilar complexities for most operations(push,pop etc.)
I found my way to this question and thought I'd offer up an example with a little context.
A classic use-case for using a Deque might be rotating/shifting elements in a collection because (as others have mentioned), you get very good (O(1)) complexity for push/pop operations on both ends because these operations are just moving references around as opposed to a list which has to physically move objects around in memory.
So here are 2 very similar-looking implementations of a rotate-left function:
def rotate_with_list(items, n):
l = list(items)
for _ in range(n):
l.append(l.pop(0))
return l
from collections import deque
def rotate_with_deque(items, n):
d = deque(items)
for _ in range(n):
d.append(d.popleft())
return d
Note: This is such a common use of a deque that the deque has a built-in rotate method, but I'm doing it manually here for the sake of visual comparison.
Now let's %timeit.
In [1]: def rotate_with_list(items, n):
...: l = list(items)
...: for _ in range(n):
...: l.append(l.pop(0))
...: return l
...:
...: from collections import deque
...: def rotate_with_deque(items, n):
...: d = deque(items)
...: for _ in range(n):
...: d.append(d.popleft())
...: return d
...:
In [2]: items = range(100000)
In [3]: %timeit rotate_with_list(items, 800)
100 loops, best of 3: 17.8 ms per loop
In [4]: %timeit rotate_with_deque(items, 800)
The slowest run took 5.89 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 527 µs per loop
In [5]: %timeit rotate_with_list(items, 8000)
10 loops, best of 3: 174 ms per loop
In [6]: %timeit rotate_with_deque(items, 8000)
The slowest run took 8.99 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.1 ms per loop
In [7]: more_items = range(10000000)
In [8]: %timeit rotate_with_list(more_items, 800)
1 loop, best of 3: 4.59 s per loop
In [9]: %timeit rotate_with_deque(more_items, 800)
10 loops, best of 3: 109 ms per loop
Pretty interesting how both data structures expose an eerily similar interface but have drastically different performance :)
out of curiosity I tried inserting in beginning in list vs appendleft() of deque.
clearly deque is winner.
Is it possible to return two lists from a list comprehension? Well, this obviously doesn't work, but something like:
rr, tt = [i*10, i*12 for i in xrange(4)]
So rr and tt both are lists with the results from i*10 and i*12 respectively.
Many thanks
>>> rr,tt = zip(*[(i*10, i*12) for i in xrange(4)])
>>> rr
(0, 10, 20, 30)
>>> tt
(0, 12, 24, 36)
Creating two comprehensions list is better (at least for long lists). Be aware that, the best voted answer is slower can be even slower than traditional for loops. List comprehensions are faster and clearer.
python -m timeit -n 100 -s 'rr=[];tt = [];' 'for i in range(500000): rr.append(i*10);tt.append(i*12)'
10 loops, best of 3: 123 msec per loop
> python -m timeit -n 100 'rr,tt = zip(*[(i*10, i*12) for i in range(500000)])'
10 loops, best of 3: 170 msec per loop
> python -m timeit -n 100 'rr = [i*10 for i in range(500000)]; tt = [i*10 for i in range(500000)]'
10 loops, best of 3: 68.5 msec per loop
It would be nice to see list comprehensionss supporting the creation of multiple lists at a time.
However,
if you can take an advantage of using a traditional loop (to be precise, intermediate calculations), then it is possible that you will be better of with a loop (or an iterator/generator using yield). Here is an example:
$ python3 -m timeit -n 100 -s 'rr=[];tt=[];' "for i in (range(1000) for x in range(10000)): tmp = list(i); rr.append(min(tmp));tt.append(max(tmp))"
100 loops, best of 3: 314 msec per loop
$ python3 -m timeit -n 100 "rr=[min(list(i)) for i in (range(1000) for x in range(10000))];tt=[max(list(i)) for i in (range(1000) for x in range(10000))]"
100 loops, best of 3: 413 msec per loop
Of course, the comparison in these cases are unfair; in the example, the code and calculations are not equivalent because in the traditional loop a temporary result is stored (see tmp variable). So, the list comprehension is doing much more internal operations (it calculates the tmp variable twice!, yet it is only 25% slower).
It is possible for a list comprehension to return multiple lists if the elements are lists.
So for example:
>>> x, y = [[] for x in range(2)]
>>> x
[]
>>> y
[]
>>>
The trick with zip function would do the job, but actually is much more simpler and readable if you just collect the results in lists with a loop.
I would like to split a string according to the title in a single call. I'm looking for a simple syntax using list comprehension, but i don't got it yet:
s = "123456"
And the result would be:
["12", "34", "56"]
What i don't want:
re.split('(?i)([0-9a-f]{2})', s)
s[0:2], s[2:4], s[4:6]
[s[i*2:i*2+2] for i in len(s) / 2]
Edit:
Ok, i wanted to parse a hex RGB[A] color (and possible other color/component format), to extract all the component.
It seem that the fastest approach would be the last from sven-marnach:
sven-marnach xrange: 0.883 usec per loop
python -m timeit -s 's="aabbcc";' '[int(s[i:i+2], 16) / 255. for i in xrange(0, len(s), 2)]'
pair/iter: 1.38 usec per loop
python -m timeit -s 's="aabbcc"' '["%c%c" % pair for pair in zip(* 2 * [iter(s)])]'
Regex: 2.55 usec per loop
python -m timeit -s 'import re; s="aabbcc"; c=re.compile("(?i)([0-9a-f]{2})");
split=re.split' '[int(x, 16) / 255. for x in split(c, s) if x != ""]'
Reading through the comments, it turns out the actual question is: What is the fastest way to parse a color definition string in hexadecimal RRGGBBAA format. Here are some options:
def rgba1(s, unpack=struct.unpack):
return unpack("BBBB", s.decode("hex"))
def rgba2(s, int=int, xrange=xrange):
return [int(s[i:i+2], 16) for i in xrange(0, 8, 2)]
def rgba3(s, int=int, xrange=xrange):
x = int(s, 16)
return [(x >> i) & 255 for i in xrange(0, 32, 8)]
As I expected, the first version turns out to be fastest:
In [6]: timeit rgba1("aabbccdd")
1000000 loops, best of 3: 1.44 us per loop
In [7]: timeit rgba2("aabbccdd")
100000 loops, best of 3: 2.43 us per loop
In [8]: timeit rgba3("aabbccdd")
100000 loops, best of 3: 2.44 us per loop
In [4]: ["".join(pair) for pair in zip(* 2 * [iter(s)])]
Out[4]: ['aa', 'bb', 'cc']
See: How does zip(*[iter(s)]*n) work in Python? for explanations as to that strange "2-iter over the same str" syntax.
You say in the comments that you want to "have the fastest execution", I can't promise you that with this implementation, but you can measure the execution using timeit. Remember what Donald Knuth said about premature optimisation, of course. For the problem at hand (now that you've revealed it) I think you'd find r, g, b = s[0:2], s[2:4], s[4:6] hard to beat.
$ python3.2 -m timeit -c '
s = "aabbcc"
["".join(pair) for pair in zip(* 2 * [iter(s)])]
'
100000 loops, best of 3: 4.49 usec per loop
Cf.
python3.2 -m timeit -c '
s = "aabbcc"
r, g, b = s[0:2], s[2:4], s[4:6]
'
1000000 loops, best of 3: 1.2 usec per loop
Numpy is worse than your preferred solution for a single lookup:
$ python -m timeit -s 'import numpy as np; s="aabbccdd"' 'a = np.fromstring(s.decode("hex"), dtype="uint32"); a.dtype = "uint8"; list(a)'
100000 loops, best of 3: 5.14 usec per loop
$ python -m timeit -s 's="aabbcc";' '[int(s[i:i+2], 16) / 255. for i in xrange(0, len(s), 2)]'
100000 loops, best of 3: 2.41 usec per loop
But if you do several conversions at once, numpy is much faster:
$ python -m timeit -s 'import numpy as np; s="aabbccdd" * 100' 'a = np.fromstring(s.decode("hex"), dtype="uint32"); a.dtype = "uint8"; a.tolist()'
10000 loops, best of 3: 59.6 usec per loop
$ python -m timeit -s 's="aabbccdd" * 100;' '[int(s[i:i+2], 16) / 255. for i in xrange(0, len(s), 2)]'
1000 loops, best of 3: 240 usec per loop
Numpy is faster for batcher larger than 2, on my computer. You can easily group the values by setting a.shape to (number_of_colors, 4), though it makes the tolist method 50% slower.
In fact, most of the time is spent converting the array to a list. Depending on what you wish to do with the results, you may be able to skip this intermeditary step, and reap some benefits:
$ python -m timeit -s 'import numpy as np; s="aabbccdd" * 100' 'a = np.fromstring(s.decode("hex"), dtype="uint32"); a.dtype = "uint8"; a.shape = (100,4)'
100000 loops, best of 3: 6.76 usec per loop
If I had list of integers say,
x = [1,2,3,4,5]
Is there an in-built function that can convert this into a single number like 12345? If not, what's the easiest way?
>>> listvar = [1,2,3,4,5]
>>> reduce(lambda x,y:x*10+y, listvar, 0)
12345
If they're digits like this,
sum(digit * 10 ** place for place, digit in enumerate(reversed(x)))
int("".join(str(X) for X in x))
You have not told us what the result for x = [1, 23, 4] should be by the way...
My answer gives 1234, others give 334
Just for fun :)
int(str(x)[1:-1].replace(', ', ''))
Surprisingly, this is even faster for large list:
$ python -m timeit -s "x=[1,2,3,4,5,6,7,8,9,0]*100" "int(str(x)[1:-1].replace(', ', ''))"
10000 loops, best of 3: 128 usec per loop
$ python -m timeit -s "x=[1,2,3,4,5,6,7,8,9,0]*100" "int(''.join(map(str, x)))"
10000 loops, best of 3: 183 usec per loop
$ python -m timeit -s "x=[1,2,3,4,5,6,7,8,9,0]*100" "reduce(lambda x,y:x*10+y, x, 0)"
1000 loops, best of 3: 649 usec per loop
$ python -m timeit -s "x=[1,2,3,4,5,6,7,8,9,0]*100" "sum(digit * 10 ** place for place, digit in enumerate(reversed(x)))"
100 loops, best of 3: 7.19 msec per loop
But for very small list (maybe more common?) , this one is slowest.