Related
This question is regarding the solution to the (first half of) Day 5 problem for Advent of Code 2020.
I have written two different functions to get to the same result, i.e. decoding the boarding pass string into a row, column coordinate. In the first case I went through a binary search based upon each character in the string:
def decode(bp):
row = bp[:7]
col = bp[-3:]
row_lower, row_upper = 0, 127
col_lower, col_upper = 0, 7
for char in row:
if char=='F':
row_upper = ((row_upper+row_lower))//2
else:
row_lower = ((row_upper+row_lower)+1)//2
for char in col:
if char=='L':
col_upper = ((col_upper+col_lower))//2
else:
col_lower = ((col_upper+col_lower)+1)//2
sid = (row_lower*8)+col_lower
return (row_lower, col_lower, sid)
I then realized that if we consider the string as a binary number, there is a 1:1 map to each row, column coordinate, and so I also wrote this alternative solution to the problem:
def alternative_decode(bp):
bp = bp.replace('F', '0').replace('L', '0').replace('B', '1').replace('R', '1')
return(int(bp[:7], 2), int(bp[-3:], 2), (int(bp[:7], 2)*8)+int(bp[-3:], 2))
I wrote the second solution because I expected it to be considerably faster than the first, given that it's a simple binary conversion rather than a binary search. However, when timing the two approaches, I noticed that both have the same running time, i.e roughly between 0.0000020 and 0.0000025 seconds.
What is the cause of this? Is there some Python magic happening behind the scenes that makes both solutions equally efficient, or did I write them in such a way that makes them equally unefficient?
On my computer, your alternative becomes significantly faster if you reuse the row/col result, rather than repeating the int() calls for the sid return:
def alternative_decode_reuse(bp):
bp = bp.replace('F', '0').replace('L', '0').replace('B', '1').replace('R', '1')
row = int(bp[:7], 2)
col = int(bp[-3:], 2)
sid = row*8 + col
return (row, col, sid)
After putting all three functions in a file decoding.py, this gives us:
tmp$ python -m timeit -s 'import decoding' -- 'decoding.decode("FFBBFBFLLR")'
1000000 loops, best of 3: 1.75 usec per loop
tmp$ python -m timeit -s 'import decoding' -- 'decoding.alternative_decode("FFBBFBFLLR")'
1000000 loops, best of 3: 1.62 usec per loop
tmp$ python -m timeit -s 'import decoding' -- 'decoding.alternative_decode_reuse("FFBBFBFLLR")'
1000000 loops, best of 3: 1.32 usec per loop
At this point, most of the time is spent doing the replace() line. So what if we didn't?
def third_decode(bp):
row = 0
for c in bp[:7]:
row <<= 1
if c == 'B':
row += 1
col = 0
for c in bp[7:]:
col <<= 1
if c == 'R':
col += 1
sid = row * 8 + col
return (row, col, sid)
This gives:
tmp$ python -m timeit -s 'import decoding' -- 'decoding.third_decode("FFBBFBFLLR")'
1000000 loops, best of 3: 1.37 usec per loop
Slightly worse, or at least not clearly better. What if we also use the fact that the desired sid is equivalent to a (binary) concatenation of the row/col numbers?
def fourth_decode(bp):
sid = 0
for c in bp:
sid <<= 1
if c in 'BR':
sid += 1
row = sid >> 3
col = sid & 7
return (row, col, sid)
Yes, that helps a little:
tmp$ python -m timeit -s 'import decoding' -- 'decoding.fourth_decode("FFBBFBFLLR")'
1000000 loops, best of 3: 1.16 usec per loop
At this point, I'm tired of editing cmdline args to re-run everything, so let's add this to the bottom of decoding.py:
if __name__ == '__main__':
loops = 1000000
funcs = (
decode, alternative_decode, alternative_decode_reuse, replace_only,
third_decode, fourth_decode,
)
from timeit import Timer
for fun in funcs:
cmd = 'decoding.%s("FFBBFBFLLR")' % fun.__name__
timer = Timer(cmd, setup='import decoding')
totaltime = min(timer.repeat(5, loops))
fmt = '%25s returned %14r -- %8d loops, best of 5: %6d ns per loop'
arg = (fun.__name__, fun('FFBBFBFLLR'), loops, totaltime*1000000000/loops)
print(fmt % arg)
This lets us run all the functions without hassle:
tmp$ python decoding.py
decode returned (26, 1, 209) -- 1000000 loops, best of 5: 2090 ns per loop
alternative_decode returned (26, 1, 209) -- 1000000 loops, best of 5: 1829 ns per loop
alternative_decode_reuse returned (26, 1, 209) -- 1000000 loops, best of 5: 1414 ns per loop
replace_only returned None -- 1000000 loops, best of 5: 700 ns per loop
third_decode returned (26, 1, 209) -- 1000000 loops, best of 5: 1368 ns per loop
fourth_decode returned (26, 1, 209) -- 1000000 loops, best of 5: 1123 ns per loop
After that, I'm out of ideas of how to make it go faster. But last year, an Advent of Code-solving acquaintance told me he's using the pypy Python implementation for its speed. Maybe it can help?
tmp$ pypy decoding.py
decode returned (26, 1, 209) -- 1000000 loops, best of 5: 151 ns per loop
alternative_decode returned (26, 1, 209) -- 1000000 loops, best of 5: 4 ns per loop
alternative_decode_reuse returned (26, 1, 209) -- 1000000 loops, best of 5: 3 ns per loop
replace_only returned None -- 1000000 loops, best of 5: 3 ns per loop
third_decode returned (26, 1, 209) -- 1000000 loops, best of 5: 141 ns per loop
fourth_decode returned (26, 1, 209) -- 1000000 loops, best of 5: 138 ns per loop
Well, all my effort for nothing! :)
It looks like pypy's replace() and int() functions are much faster. Also, while its JIT does make our various loopy functions faster, it's still preferable to use the builtin functions when possible.
An 256*64 pixel OLED display connected to Raspberry Pi (Zero W) has 4 bit greyscale pixel data packed into a byte (i.e. two pixels per byte), so 8192 bytes in total. E.g. the bytes
0a 0b 0c 0d (only lower nibble has data)
become
ab cd
Converting these bytes either obtained from a Pillow (PIL) Image or a cairo ImageSurface takes up to 0.9 s when naively iterating the pixel data, depending on color depth.
Combining every two bytes from a Pillow "L" (monochrome 8 bit) Image:
imd = im.tobytes()
nibbles = [int(p / 16) for p in imd]
packed = []
msn = None
for n in nibbles:
nib = n & 0x0F
if msn is not None:
b = msn << 4 | nib
packed.append(b)
msn = None
else:
msn = nib
This (omitting state and saving float/integer conversion) brings it down to about half (0.2 s):
packed = []
for b in range(0, 256*64, 2):
packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) )
Basically the first applied to an RGB24 (32 bit!) cairo ImageSurface, though with crude greyscale conversion:
mv = surface.get_data()
w = surface.get_width()
h = surface.get_height()
f = surface.get_format()
s = surface.get_stride()
print(len(mv), w, h, f, s)
# convert xRGB
o = []
msn = None
for p in range(0, len(mv), 4):
nib = int( (mv[p+1] + mv[p+2] + mv[p+3]) / 3 / 16) & 0x0F
if msn is not None:
b = msn << 4 | nib
o.append(b)
msn = None
else:
msn = nib
takes about twice as long (0.9 s vs 0.4 s).
The struct module does not support nibbles (half-bytes).
bitstring does allow packing nibbles:
>>> a = bitstring.BitStream()
>>> a.insert('0xf')
>>> a.insert('0x1')
>>> a
BitStream('0xf1')
>>> a.insert(5)
>>> a
BitStream('0b1111000100000')
>>> a.insert('0x2')
>>> a
BitStream('0b11110001000000010')
>>>
But there does not seem to be a method to unpack this into a list of integers quickly -- this takes 30 seconds!:
a = bitstring.BitStream()
for p in imd:
a.append( bitstring.Bits(uint=p//16, length=4) )
packed=[]
a.pos=0
for p in range(256*64//2):
packed.append( a.read(8).uint )
Does Python 3 have the means to do this efficiently or do I need an alternative?
External packer wrapped with ctypes? The same, but simpler, with Cython (I have not yet looked into these)? Looks very good, see my answer.
Down to 130 ms from 200 ms by just wrapping the loop in a function
def packer0(imd):
"""same loop in a def"""
packed = []
for b in range(0, 256*64, 2):
packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) )
return packed
Down to 35 ms by Cythonizing the same code
def packer1(imd):
"""Cythonize python nibble packing loop"""
packed = []
for b in range(0, 256*64, 2):
packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) )
return packed
Down to 16 ms with type
def packer2(imd):
"""Cythonize python nibble packing loop, typed"""
packed = []
cdef unsigned int b
for b in range(0, 256*64, 2):
packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) )
return packed
Not much of a difference with a "simplified" loop
def packer3(imd):
"""Cythonize python nibble packing loop, typed"""
packed = []
cdef unsigned int i
for i in range(256*64/2):
packed.append( (imd[i*2]//16)<<4 | (imd[i*2+1]//16) )
return packed
Maybe a tiny bit faster even (15 ms)
def packer4(it):
"""Cythonize python nibble packing loop, typed"""
cdef unsigned int n = len(it)//2
cdef unsigned int i
return [ (it[i*2]//16)<<4 | it[i*2+1]//16 for i in range(n) ]
Here's with timeit
>>> timeit.timeit('packer4(data)', setup='from pack import packer4; data = [0]*256*64', number=100)
1.31725951000044
>>> exit()
pi#raspberrypi:~ $ python3 -m timeit -s 'from pack import packer4; data = [0]*256*64' 'packer4(data)'
100 loops, best of 3: 9.04 msec per loop
This already meets my requirements, but I guess there may be further optimization possible with the input/output iterables (-> unsigned int array?) or accessing the input data with a wider data type (Raspbian is 32 bit, BCM2835 is ARM1176JZF-S single-core).
Or with parallelism on the GPU or the multi-core Raspberry Pis.
A crude comparison with the same loop in C (ideone):
#include <stdio.h>
#include <stdint.h>
#define SIZE (256*64)
int main(void) {
uint8_t in[SIZE] = {0};
uint8_t out[SIZE/2] = {0};
uint8_t t;
for(t=0; t<100; t++){
uint16_t i;
for(i=0; i<SIZE/2; i++){
out[i] = (in[i*2]/16)<<4 | in[i*2+1]/16;
}
}
return 0;
}
It's apparently 100 times faster:
pi#raspberry:~ $ gcc p.c
pi#raspberry:~ $ time ./a.out
real 0m0.085s
user 0m0.060s
sys 0m0.010s
Eliminating the the shifts/division may be another slight optimization (I have not checked the resulting C, nor the binary):
def packs(bytes it):
"""Cythonize python nibble packing loop, typed"""
cdef unsigned int n = len(it)//2
cdef unsigned int i
return [ ( (it[i<<1]&0xF0) | (it[(i<<1)+1]>>4) ) for i in range(n) ]
results in
python3 -m timeit -s 'from pack import pack; data = bytes([0]*256*64)' 'pack(data)'
100 loops, best of 3: 12.7 msec per loop
python3 -m timeit -s 'from pack import packs; data = bytes([0]*256*64)' 'packs(data)'
100 loops, best of 3: 12 msec per loop
python3 -m timeit -s 'from pack import packs; data = bytes([0]*256*64)' 'packs(data)'
100 loops, best of 3: 11 msec per loop
python3 -m timeit -s 'from pack import pack; data = bytes([0]*256*64)' 'pack(data)'
100 loops, best of 3: 13.9 msec per loop
With numpy arrays, I want to perform this operation:
move x[1],...,x[n-1] to x[0],...,x[n-2] (left shift),
write a new value in the last index: x[n-1] = newvalue.
This is similar to a pop(), push(newvalue) for a first-in last-out queue (only inverted).
A naive implementation is: x[:-1] = x[1:]; x[-1] = newvalue.
Another implementation, using np.concatenate, is slower: np.concatenate((x[1:], np.array(newvalue).reshape(1,)), axis=0).
Is there a fastest way to do it?
After some experiments, it is clear that:
copying is required,
and the fastest and simplest way to do that, for nparray (numpy arrays) is a slicing and copying.
So the solution is: x[:-1] = x[1:]; x[-1] = newvalue.
Here is a small benchmark:
>>> x = np.random.randint(0, 1e6, 10**8); newvalue = -100
>>> %timeit x[:-1] = x[1:]; x[-1] = newvalue
1000 loops, best of 3: 73.6 ms per loop
>>> %timeit np.concatenate((x[1:], np.array(newvalue).reshape(1,)), axis=0)
1 loop, best of 3: 339 ms per loop
But if you don't need to have a fast access to all values in the array, but only the first or last ones, using a deque is smarter.
I know I'm late and this question has been satisfactorily answered, but I was just facing something similar for recording a buffer of streaming data.
You mentioned "first-in last-out" which is a stack, but your example demonstrates a queue, so I will share a solution for a queue that does not require copying to enqueue new items. (You will eventually need to do one copy using numpy.roll to pass the final array to another function.)
You can use a circular array with a pointer that tracks where the tail is (the place you will be adding new items to the queue).
If you start with this array:
x[0], x[1], x[2], x[3], x[4], x[5]
/\
tail
and you want to drop x[0] and add x[6] you can do this using the originally allocated memory for the array without the need for copy
x[6], x[1], x[2], x[3], x[4], x[5]
/\
tail
and so on...
x[6], x[7], x[2], x[3], x[4], x[5]
/\
tail
Each time you enqueue you move the tail one spot to the right. You can use modulus to make this wrap nicely: new_tail = (old_tail + 1) % length.
Finding the head of the queue is always one spot after the tail. This can be found using the same formula: head = (tail + 1) % length.
head
\/
x[6], x[7], x[2], x[3], x[4], x[5]
/\
tail
Here is an example of the class I created for this circular buffer/array:
# benchmark_circular_buffer.py
import numpy as np
# all operations are O(1) and don't require copying the array
# except to_array which has to copy the array and is O(n)
class RecordingQueue1D:
def __init__(self, object: object, maxlen: int):
#allocate the memory we need ahead of time
self.max_length: int = maxlen
self.queue_tail: int = maxlen - 1
o_len = len(object)
if (o_len == maxlen):
self.rec_queue = np.array(object, dtype=np.int64)
elif (o_len > maxlen):
self.rec_queue = np.array(object[o_len-maxlen:], dtype=np.int64)
else:
self.rec_queue = np.append(np.array(object, dtype=np.int64), np.zeros(maxlen-o_len, dtype=np.int64))
self.queue_tail = o_len - 1
def to_array(self) -> np.array:
head = (self.queue_tail + 1) % self.max_length
return np.roll(self.rec_queue, -head) # this will force a copy
def enqueue(self, new_data: np.array) -> None:
# move tail pointer forward then insert at the tail of the queue
# to enforce max length of recording
self.queue_tail = (self.queue_tail + 1) % self.max_length
self.rec_queue[self.queue_tail] = new_data
def peek(self) -> int:
queue_head = (self.queue_tail + 1) % self.max_length
return self.rec_queue[queue_head]
def replace_item_at(self, index: int, new_value: int):
loc = (self.queue_tail + 1 + index) % self.max_length
self.rec_queue[loc] = new_val
def item_at(self, index: int) -> int:
# the item we want will be at head + index
loc = (self.queue_tail + 1 + index) % self.max_length
return self.rec_queue[loc]
def __repr__(self):
return "tail: " + str(self.queue_tail) + "\narray: " + str(self.rec_queue)
def __str__(self):
return "tail: " + str(self.queue_tail) + "\narray: " + str(self.rec_queue)
# return str(self.to_array())
rnd_arr = np.random.randint(0, 1e6, 10**8)
new_val = -100
slice_arr = rnd_arr.copy()
c_buf_arr = RecordingQueue1D(rnd_arr.copy(), len(rnd_arr))
# Test speed for queuing new a new item
# swapping items 100 and 1000
# swapping items 10000 and 100000
def slice_and_copy():
slice_arr[:-1] = slice_arr[1:]
slice_arr[-1] = new_val
old = slice_arr[100]
slice_arr[100] = slice_arr[1000]
old = slice_arr[10000]
slice_arr[10000] = slice_arr[100000]
def circular_buffer():
c_buf_arr.enqueue(new_val)
old = c_buf_arr.item_at(100)
slice_arr[100] = slice_arr[1000]
old = slice_arr[10000]
slice_arr[10000] = slice_arr[100000]
# lets add copying the array to a new numpy.array
# this will take O(N) time for the circular buffer because we use numpy.roll()
# which copies the array.
def slice_and_copy_assignemnt():
slice_and_copy()
my_throwaway_arr = slice_arr.copy()
return my_throwaway_arr
def circular_buffer_assignment():
circular_buffer()
my_throwaway_arr = c_buf_arr.to_array().copy()
return my_throwaway_arr
# test using
# python -m timeit -s "import benchmark_circular_buffer as bcb" "bcb.slice_and_copy()"
# python -m timeit -s "import benchmark_circular_buffer as bcb" "bcb.circular_buffer()"
# python -m timeit -r 5 -n 4 -s "import benchmark_circular_buffer as bcb" "bcb.slice_and_copy_assignemnt()"
# python -m timeit -r 5 -n 4 -s "import benchmark_circular_buffer as bcb" "bcb.circular_buffer_assignment()"
When you have to enqueue a lot of items without needing hand off a copy of the array, this a couple magnitudes faster than slicing.
Accessing items and replacing items is O(1). Enqueue and peek are both O(1). Copying the array takes O(n) time.
Benchmarking Results:
(thermal_venv) PS X:\win10\repos\thermal> python -m timeit -s "import benchmark_circular_buffer as bcb" "bcb.slice_and_copy()"
10 loops, best of 5: 36.7 msec per loop
(thermal_venv) PS X:\win10\repos\thermal> python -m timeit -s "import benchmark_circular_buffer as bcb" "bcb.circular_buffer()"
200000 loops, best of 5: 1.04 usec per loop
(thermal_venv) PS X:\win10\repos\thermal> python -m timeit -s "import benchmark_circular_buffer as bcb" "bcb.slice_and_copy_assignemnt()"
2 loops, best of 5: 166 msec per loop
(thermal_venv) PS X:\win10\repos\thermal> python -m timeit -r 5 -n 4 -s "import benchmark_circular_buffer as bcb" "bcb.slice_and_copy_assignemnt()"
4 loops, best of 5: 159 msec per loop
(thermal_venv) PS X:\win10\repos\thermal> python -m timeit -r 5 -n 4 -s "import benchmark_circular_buffer as bcb" "bcb.circular_buffer_assignment()"
4 loops, best of 5: 511 msec per loop
There is a test script and an implementation that handles 2D arrays on my GitHub here
I was playing around with timeit and noticed that doing a simple list comprehension over a small string took longer than doing the same operation on a list of small single character strings. Any explanation? It's almost 1.35 times as much time.
>>> from timeit import timeit
>>> timeit("[x for x in 'abc']")
2.0691067844831528
>>> timeit("[x for x in ['a', 'b', 'c']]")
1.5286479570345861
What's happening on a lower level that's causing this?
TL;DR
The actual speed difference is closer to 70% (or more) once a lot of the overhead is removed, for Python 2.
Object creation is not at fault. Neither method creates a new object, as one-character strings are cached.
The difference is unobvious, but is likely created from a greater number of checks on string indexing, with regards to the type and well-formedness. It is also quite likely thanks to the need to check what to return.
List indexing is remarkably fast.
>>> python3 -m timeit '[x for x in "abc"]'
1000000 loops, best of 3: 0.388 usec per loop
>>> python3 -m timeit '[x for x in ["a", "b", "c"]]'
1000000 loops, best of 3: 0.436 usec per loop
This disagrees with what you've found...
You must be using Python 2, then.
>>> python2 -m timeit '[x for x in "abc"]'
1000000 loops, best of 3: 0.309 usec per loop
>>> python2 -m timeit '[x for x in ["a", "b", "c"]]'
1000000 loops, best of 3: 0.212 usec per loop
Let's explain the difference between the versions. I'll examine the compiled code.
For Python 3:
import dis
def list_iterate():
[item for item in ["a", "b", "c"]]
dis.dis(list_iterate)
#>>> 4 0 LOAD_CONST 1 (<code object <listcomp> at 0x7f4d06b118a0, file "", line 4>)
#>>> 3 LOAD_CONST 2 ('list_iterate.<locals>.<listcomp>')
#>>> 6 MAKE_FUNCTION 0
#>>> 9 LOAD_CONST 3 ('a')
#>>> 12 LOAD_CONST 4 ('b')
#>>> 15 LOAD_CONST 5 ('c')
#>>> 18 BUILD_LIST 3
#>>> 21 GET_ITER
#>>> 22 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
#>>> 25 POP_TOP
#>>> 26 LOAD_CONST 0 (None)
#>>> 29 RETURN_VALUE
def string_iterate():
[item for item in "abc"]
dis.dis(string_iterate)
#>>> 21 0 LOAD_CONST 1 (<code object <listcomp> at 0x7f4d06b17150, file "", line 21>)
#>>> 3 LOAD_CONST 2 ('string_iterate.<locals>.<listcomp>')
#>>> 6 MAKE_FUNCTION 0
#>>> 9 LOAD_CONST 3 ('abc')
#>>> 12 GET_ITER
#>>> 13 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
#>>> 16 POP_TOP
#>>> 17 LOAD_CONST 0 (None)
#>>> 20 RETURN_VALUE
You see here that the list variant is likely to be slower due to the building of the list each time.
This is the
9 LOAD_CONST 3 ('a')
12 LOAD_CONST 4 ('b')
15 LOAD_CONST 5 ('c')
18 BUILD_LIST 3
part. The string variant only has
9 LOAD_CONST 3 ('abc')
You can check that this does seem to make a difference:
def string_iterate():
[item for item in ("a", "b", "c")]
dis.dis(string_iterate)
#>>> 35 0 LOAD_CONST 1 (<code object <listcomp> at 0x7f4d068be660, file "", line 35>)
#>>> 3 LOAD_CONST 2 ('string_iterate.<locals>.<listcomp>')
#>>> 6 MAKE_FUNCTION 0
#>>> 9 LOAD_CONST 6 (('a', 'b', 'c'))
#>>> 12 GET_ITER
#>>> 13 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
#>>> 16 POP_TOP
#>>> 17 LOAD_CONST 0 (None)
#>>> 20 RETURN_VALUE
This produces just
9 LOAD_CONST 6 (('a', 'b', 'c'))
as tuples are immutable. Test:
>>> python3 -m timeit '[x for x in ("a", "b", "c")]'
1000000 loops, best of 3: 0.369 usec per loop
Great, back up to speed.
For Python 2:
def list_iterate():
[item for item in ["a", "b", "c"]]
dis.dis(list_iterate)
#>>> 2 0 BUILD_LIST 0
#>>> 3 LOAD_CONST 1 ('a')
#>>> 6 LOAD_CONST 2 ('b')
#>>> 9 LOAD_CONST 3 ('c')
#>>> 12 BUILD_LIST 3
#>>> 15 GET_ITER
#>>> >> 16 FOR_ITER 12 (to 31)
#>>> 19 STORE_FAST 0 (item)
#>>> 22 LOAD_FAST 0 (item)
#>>> 25 LIST_APPEND 2
#>>> 28 JUMP_ABSOLUTE 16
#>>> >> 31 POP_TOP
#>>> 32 LOAD_CONST 0 (None)
#>>> 35 RETURN_VALUE
def string_iterate():
[item for item in "abc"]
dis.dis(string_iterate)
#>>> 2 0 BUILD_LIST 0
#>>> 3 LOAD_CONST 1 ('abc')
#>>> 6 GET_ITER
#>>> >> 7 FOR_ITER 12 (to 22)
#>>> 10 STORE_FAST 0 (item)
#>>> 13 LOAD_FAST 0 (item)
#>>> 16 LIST_APPEND 2
#>>> 19 JUMP_ABSOLUTE 7
#>>> >> 22 POP_TOP
#>>> 23 LOAD_CONST 0 (None)
#>>> 26 RETURN_VALUE
The odd thing is that we have the same building of the list, but it's still faster for this. Python 2 is acting strangely fast.
Let's remove the comprehensions and re-time. The _ = is to prevent it getting optimised out.
>>> python3 -m timeit '_ = ["a", "b", "c"]'
10000000 loops, best of 3: 0.0707 usec per loop
>>> python3 -m timeit '_ = "abc"'
100000000 loops, best of 3: 0.0171 usec per loop
We can see that initialization is not significant enough to account for the difference between the versions (those numbers are small)! We can thus conclude that Python 3 has slower comprehensions. This makes sense as Python 3 changed comprehensions to have safer scoping.
Well, now improve the benchmark (I'm just removing overhead that isn't iteration). This removes the building of the iterable by pre-assigning it:
>>> python3 -m timeit -s 'iterable = "abc"' '[x for x in iterable]'
1000000 loops, best of 3: 0.387 usec per loop
>>> python3 -m timeit -s 'iterable = ["a", "b", "c"]' '[x for x in iterable]'
1000000 loops, best of 3: 0.368 usec per loop
>>> python2 -m timeit -s 'iterable = "abc"' '[x for x in iterable]'
1000000 loops, best of 3: 0.309 usec per loop
>>> python2 -m timeit -s 'iterable = ["a", "b", "c"]' '[x for x in iterable]'
10000000 loops, best of 3: 0.164 usec per loop
We can check if calling iter is the overhead:
>>> python3 -m timeit -s 'iterable = "abc"' 'iter(iterable)'
10000000 loops, best of 3: 0.099 usec per loop
>>> python3 -m timeit -s 'iterable = ["a", "b", "c"]' 'iter(iterable)'
10000000 loops, best of 3: 0.1 usec per loop
>>> python2 -m timeit -s 'iterable = "abc"' 'iter(iterable)'
10000000 loops, best of 3: 0.0913 usec per loop
>>> python2 -m timeit -s 'iterable = ["a", "b", "c"]' 'iter(iterable)'
10000000 loops, best of 3: 0.0854 usec per loop
No. No it is not. The difference is too small, especially for Python 3.
So let's remove yet more unwanted overhead... by making the whole thing slower! The aim is just to have a longer iteration so the time hides overhead.
>>> python3 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' '[x for x in iterable]'
100 loops, best of 3: 3.12 msec per loop
>>> python3 -m timeit -s 'import random; iterable = [chr(random.randint(0, 127)) for _ in range(100000)]' '[x for x in iterable]'
100 loops, best of 3: 2.77 msec per loop
>>> python2 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' '[x for x in iterable]'
100 loops, best of 3: 2.32 msec per loop
>>> python2 -m timeit -s 'import random; iterable = [chr(random.randint(0, 127)) for _ in range(100000)]' '[x for x in iterable]'
100 loops, best of 3: 2.09 msec per loop
This hasn't actually changed much, but it's helped a little.
So remove the comprehension. It's overhead that's not part of the question:
>>> python3 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'for x in iterable: pass'
1000 loops, best of 3: 1.71 msec per loop
>>> python3 -m timeit -s 'import random; iterable = [chr(random.randint(0, 127)) for _ in range(100000)]' 'for x in iterable: pass'
1000 loops, best of 3: 1.36 msec per loop
>>> python2 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'for x in iterable: pass'
1000 loops, best of 3: 1.27 msec per loop
>>> python2 -m timeit -s 'import random; iterable = [chr(random.randint(0, 127)) for _ in range(100000)]' 'for x in iterable: pass'
1000 loops, best of 3: 935 usec per loop
That's more like it! We can get slightly faster still by using deque to iterate. It's basically the same, but it's faster:
>>> python3 -m timeit -s 'import random; from collections import deque; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 777 usec per loop
>>> python3 -m timeit -s 'import random; from collections import deque; iterable = [chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 405 usec per loop
>>> python2 -m timeit -s 'import random; from collections import deque; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 805 usec per loop
>>> python2 -m timeit -s 'import random; from collections import deque; iterable = [chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 438 usec per loop
What impresses me is that Unicode is competitive with bytestrings. We can check this explicitly by trying bytes and unicode in both:
bytes
>>> python3 -m timeit -s 'import random; from collections import deque; iterable = b"".join(chr(random.randint(0, 127)).encode("ascii") for _ in range(100000))' 'deque(iterable, maxlen=0)' :(
1000 loops, best of 3: 571 usec per loop
>>> python3 -m timeit -s 'import random; from collections import deque; iterable = [chr(random.randint(0, 127)).encode("ascii") for _ in range(100000)]' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 394 usec per loop
>>> python2 -m timeit -s 'import random; from collections import deque; iterable = b"".join(chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 757 usec per loop
>>> python2 -m timeit -s 'import random; from collections import deque; iterable = [chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 438 usec per loop
Here you see Python 3 actually faster than Python 2.
unicode
>>> python3 -m timeit -s 'import random; from collections import deque; iterable = u"".join( chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 800 usec per loop
>>> python3 -m timeit -s 'import random; from collections import deque; iterable = [ chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 394 usec per loop
>>> python2 -m timeit -s 'import random; from collections import deque; iterable = u"".join(unichr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 1.07 msec per loop
>>> python2 -m timeit -s 'import random; from collections import deque; iterable = [unichr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 469 usec per loop
Again, Python 3 is faster, although this is to be expected (str has had a lot of attention in Python 3).
In fact, this unicode-bytes difference is very small, which is impressive.
So let's analyse this one case, seeing as it's fast and convenient for me:
>>> python3 -m timeit -s 'import random; from collections import deque; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 777 usec per loop
>>> python3 -m timeit -s 'import random; from collections import deque; iterable = [chr(random.randint(0, 127)) for _ in range(100000)]' 'deque(iterable, maxlen=0)'
1000 loops, best of 3: 405 usec per loop
We can actually rule out Tim Peter's 10-times-upvoted answer!
>>> foo = iterable[123]
>>> iterable[36] is foo
True
These are not new objects!
But this is worth mentioning: indexing costs. The difference will likely be in the indexing, so remove the iteration and just index:
>>> python3 -m timeit -s 'import random; iterable = "".join(chr(random.randint(0, 127)) for _ in range(100000))' 'iterable[123]'
10000000 loops, best of 3: 0.0397 usec per loop
>>> python3 -m timeit -s 'import random; iterable = [chr(random.randint(0, 127)) for _ in range(100000)]' 'iterable[123]'
10000000 loops, best of 3: 0.0374 usec per loop
The difference seems small, but at least half of the cost is overhead:
>>> python3 -m timeit -s 'import random; iterable = [chr(random.randint(0, 127)) for _ in range(100000)]' 'iterable; 123'
100000000 loops, best of 3: 0.0173 usec per loop
so the speed difference is sufficient to decide to blame it. I think.
So why is indexing a list so much faster?
Well, I'll come back to you on that, but my guess is that's is down to the check for interned strings (or cached characters if it's a separate mechanism). This will be less fast than optimal. But I'll go check the source (although I'm not comfortable in C...) :).
So here's the source:
static PyObject *
unicode_getitem(PyObject *self, Py_ssize_t index)
{
void *data;
enum PyUnicode_Kind kind;
Py_UCS4 ch;
PyObject *res;
if (!PyUnicode_Check(self) || PyUnicode_READY(self) == -1) {
PyErr_BadArgument();
return NULL;
}
if (index < 0 || index >= PyUnicode_GET_LENGTH(self)) {
PyErr_SetString(PyExc_IndexError, "string index out of range");
return NULL;
}
kind = PyUnicode_KIND(self);
data = PyUnicode_DATA(self);
ch = PyUnicode_READ(kind, data, index);
if (ch < 256)
return get_latin1_char(ch);
res = PyUnicode_New(1, ch);
if (res == NULL)
return NULL;
kind = PyUnicode_KIND(res);
data = PyUnicode_DATA(res);
PyUnicode_WRITE(kind, data, 0, ch);
assert(_PyUnicode_CheckConsistency(res, 1));
return res;
}
Walking from the top, we'll have some checks. These are boring. Then some assigns, which should also be boring. The first interesting line is
ch = PyUnicode_READ(kind, data, index);
but we'd hope that is fast, as we're reading from a contiguous C array by indexing it. The result, ch, will be less than 256 so we'll return the cached character in get_latin1_char(ch).
So we'll run (dropping the first checks)
kind = PyUnicode_KIND(self);
data = PyUnicode_DATA(self);
ch = PyUnicode_READ(kind, data, index);
return get_latin1_char(ch);
Where
#define PyUnicode_KIND(op) \
(assert(PyUnicode_Check(op)), \
assert(PyUnicode_IS_READY(op)), \
((PyASCIIObject *)(op))->state.kind)
(which is boring because asserts get ignored in debug [so I can check that they're fast] and ((PyASCIIObject *)(op))->state.kind) is (I think) an indirection and a C-level cast);
#define PyUnicode_DATA(op) \
(assert(PyUnicode_Check(op)), \
PyUnicode_IS_COMPACT(op) ? _PyUnicode_COMPACT_DATA(op) : \
_PyUnicode_NONCOMPACT_DATA(op))
(which is also boring for similar reasons, assuming the macros (Something_CAPITALIZED) are all fast),
#define PyUnicode_READ(kind, data, index) \
((Py_UCS4) \
((kind) == PyUnicode_1BYTE_KIND ? \
((const Py_UCS1 *)(data))[(index)] : \
((kind) == PyUnicode_2BYTE_KIND ? \
((const Py_UCS2 *)(data))[(index)] : \
((const Py_UCS4 *)(data))[(index)] \
) \
))
(which involves indexes but really isn't slow at all) and
static PyObject*
get_latin1_char(unsigned char ch)
{
PyObject *unicode = unicode_latin1[ch];
if (!unicode) {
unicode = PyUnicode_New(1, ch);
if (!unicode)
return NULL;
PyUnicode_1BYTE_DATA(unicode)[0] = ch;
assert(_PyUnicode_CheckConsistency(unicode, 1));
unicode_latin1[ch] = unicode;
}
Py_INCREF(unicode);
return unicode;
}
Which confirms my suspicion that:
This is cached:
PyObject *unicode = unicode_latin1[ch];
This should be fast. The if (!unicode) is not run, so it's literally equivalent in this case to
PyObject *unicode = unicode_latin1[ch];
Py_INCREF(unicode);
return unicode;
Honestly, after testing the asserts are fast (by disabling them [I think it works on the C-level asserts...]), the only plausibly-slow parts are:
PyUnicode_IS_COMPACT(op)
_PyUnicode_COMPACT_DATA(op)
_PyUnicode_NONCOMPACT_DATA(op)
Which are:
#define PyUnicode_IS_COMPACT(op) \
(((PyASCIIObject*)(op))->state.compact)
(fast, as before),
#define _PyUnicode_COMPACT_DATA(op) \
(PyUnicode_IS_ASCII(op) ? \
((void*)((PyASCIIObject*)(op) + 1)) : \
((void*)((PyCompactUnicodeObject*)(op) + 1)))
(fast if the macro IS_ASCII is fast), and
#define _PyUnicode_NONCOMPACT_DATA(op) \
(assert(((PyUnicodeObject*)(op))->data.any), \
((((PyUnicodeObject *)(op))->data.any)))
(also fast as it's an assert plus an indirection plus a cast).
So we're down (the rabbit hole) to:
PyUnicode_IS_ASCII
which is
#define PyUnicode_IS_ASCII(op) \
(assert(PyUnicode_Check(op)), \
assert(PyUnicode_IS_READY(op)), \
((PyASCIIObject*)op)->state.ascii)
Hmm... that seems fast too...
Well, OK, but let's compare it to PyList_GetItem. (Yeah, thanks Tim Peters for giving me more work to do :P.)
PyObject *
PyList_GetItem(PyObject *op, Py_ssize_t i)
{
if (!PyList_Check(op)) {
PyErr_BadInternalCall();
return NULL;
}
if (i < 0 || i >= Py_SIZE(op)) {
if (indexerr == NULL) {
indexerr = PyUnicode_FromString(
"list index out of range");
if (indexerr == NULL)
return NULL;
}
PyErr_SetObject(PyExc_IndexError, indexerr);
return NULL;
}
return ((PyListObject *)op) -> ob_item[i];
}
We can see that on non-error cases this is just going to run:
PyList_Check(op)
Py_SIZE(op)
((PyListObject *)op) -> ob_item[i]
Where PyList_Check is
#define PyList_Check(op) \
PyType_FastSubclass(Py_TYPE(op), Py_TPFLAGS_LIST_SUBCLASS)
(TABS! TABS!!!) (issue21587) That got fixed and merged in 5 minutes. Like... yeah. Damn. They put Skeet to shame.
#define Py_SIZE(ob) (((PyVarObject*)(ob))->ob_size)
#define PyType_FastSubclass(t,f) PyType_HasFeature(t,f)
#ifdef Py_LIMITED_API
#define PyType_HasFeature(t,f) ((PyType_GetFlags(t) & (f)) != 0)
#else
#define PyType_HasFeature(t,f) (((t)->tp_flags & (f)) != 0)
#endif
So this is normally really trivial (two indirections and a couple of boolean checks) unless Py_LIMITED_API is on, in which case... ???
Then there's the indexing and a cast (((PyListObject *)op) -> ob_item[i]) and we're done.
So there are definitely fewer checks for lists, and the small speed differences certainly imply that it could be relevant.
I think in general, there's just more type-checking and indirection (->) for Unicode. It seems I'm missing a point, but what?
When you iterate over most container objects (lists, tuples, dicts, ...), the iterator delivers the objects in the container.
But when you iterate over a string, a new object has to be created for each character delivered - a string is not "a container" in the same sense a list is a container. The individual characters in a string don't exist as distinct objects before iteration creates those objects.
You could be incurring and overhead for creating the iterator for the string. Whereas the array already contains an iterator upon instantiation.
EDIT:
>>> timeit("[x for x in ['a','b','c']]")
0.3818681240081787
>>> timeit("[x for x in 'abc']")
0.3732869625091553
This was ran using 2.7, but on my mac book pro i7. This could be the result of a system configuration difference.
I wish to compute a simple checksum : just adding the values of all bytes.
The quickest way I found is:
checksum = sum([ord(c) for c in buf])
But for 13 Mb data buf, it takes 4.4 s : too long (in C, it takes 0.5 s)
If I use :
checksum = zlib.adler32(buf) & 0xffffffff
it takes 0.8 s, but the result is not the one I want.
So my question is: is there any function, or lib or C to include in python 2.6, to compute a simple checksum ?
Thanks by advance,
Eric.
You could use sum(bytearray(buf)):
In [1]: buf = b'a'*(13*(1<<20))
In [2]: %timeit sum(ord(c) for c in buf)
1 loops, best of 3: 1.25 s per loop
In [3]: %timeit sum(imap(ord, buf))
1 loops, best of 3: 564 ms per loop
In [4]: %timeit b=bytearray(buf); sum(b)
10 loops, best of 3: 101 ms per loop
Here's a C extension for Python written in Cython, sumbytes.pyx file:
from libc.limits cimport ULLONG_MAX, UCHAR_MAX
def sumbytes(bytes buf not None):
cdef:
unsigned long long total = 0
unsigned char c
if len(buf) > (ULLONG_MAX // <size_t>UCHAR_MAX):
raise NotImplementedError #todo: implement for > 8 PiB available memory
for c in buf:
total += c
return total
sumbytes is ~10 time faster than bytearray variant:
name time ratio
sumbytes_sumbytes 12 msec 1.00
sumbytes_numpy 29.6 msec 2.48
sumbytes_bytearray 122 msec 10.19
To reproduce the time measurements, download reporttime.py and run:
#!/usr/bin/env python
# compile on-the-fly
import pyximport; pyximport.install() # pip install cython
import numpy as np
from reporttime import get_functions_with_prefix, measure
from sumbytes import sumbytes # from sumbytes.pyx
def sumbytes_sumbytes(input):
return sumbytes(input)
def sumbytes_bytearray(input):
return sum(bytearray(input))
def sumbytes_numpy(input):
return np.frombuffer(input, 'uint8').sum() # #root's answer
def main():
funcs = get_functions_with_prefix('sumbytes_')
buf = ''.join(map(unichr, range(256))).encode('latin1') * (1 << 16)
measure(funcs, args=[buf])
main()
Use numpy.frombuffer(buf, "uint8").sum(), it seems to be about 70 times faster than your example:
In [9]: import numpy as np
In [10]: buf = b'a'*(13*(1<<20))
In [11]: sum(bytearray(buf))
Out[11]: 1322254336
In [12]: %timeit sum(bytearray(buf))
1 loops, best of 3: 253 ms per loop
In [13]: np.frombuffer(buf, "uint8").sum()
Out[13]: 1322254336
In [14]: %timeit np.frombuffer(buf, "uint8").sum()
10 loops, best of 3: 36.7 ms per loop
In [15]: %timeit sum([ord(c) for c in buf])
1 loops, best of 3: 2.65 s per loop