Why are Python's arrays slow?

Why are Python's arrays slow? - python

I expected array.array to be faster than lists, as arrays seem to be unboxed.
However, I get the following result:
In [1]: import array
In [2]: L = list(range(100000000))
In [3]: A = array.array('l', range(100000000))
In [4]: %timeit sum(L)
1 loop, best of 3: 667 ms per loop
In [5]: %timeit sum(A)
1 loop, best of 3: 1.41 s per loop
In [6]: %timeit sum(L)
1 loop, best of 3: 627 ms per loop
In [7]: %timeit sum(A)
1 loop, best of 3: 1.39 s per loop
What could be the cause of such a difference?

The storage is "unboxed", but every time you access an element Python has to "box" it (embed it in a regular Python object) in order to do anything with it. For example, your sum(A) iterates over the array, and boxes each integer, one at a time, in a regular Python int object. That costs time. In your sum(L), all the boxing was done at the time the list was created.
So, in the end, an array is generally slower, but requires substantially less memory.
Here's the relevant code from a recent version of Python 3, but the same basic ideas apply to all CPython implementations since Python was first released.
Here's the code to access a list item:
PyObject *
PyList_GetItem(PyObject *op, Py_ssize_t i)
{
/* error checking omitted */
return ((PyListObject *)op) -> ob_item[i];
}
There's very little to it: somelist[i] just returns the i'th object in the list (and all Python objects in CPython are pointers to a struct whose initial segment conforms to the layout of a struct PyObject).
And here's the __getitem__ implementation for an array with type code l:
static PyObject *
l_getitem(arrayobject *ap, Py_ssize_t i)
{
return PyLong_FromLong(((long *)ap->ob_item)[i]);
}
The raw memory is treated as a vector of platform-native C long integers; the i'th C long is read up; and then PyLong_FromLong() is called to wrap ("box") the native C long in a Python long object (which, in Python 3, which eliminates Python 2's distinction between int and long, is actually shown as type int).
This boxing has to allocate new memory for a Python int object, and spray the native C long's bits into it. In the context of the original example, this object's lifetime is very brief (just long enough for sum() to add the contents into a running total), and then more time is required to deallocate the new int object.
This is where the speed difference comes from, always has come from, and always will come from in the CPython implementation.

To add to Tim Peters' excellent answer, arrays implement the buffer protocol, while lists do not. This means that, if you are writing a C extension (or the moral equivalent, such as writing a Cython module), then you can access and work with the elements of an array much faster than anything Python can do. This will give you considerable speed improvements, possibly well over an order of magnitude. However, it has a number of downsides:
You are now in the business of writing C instead of Python. Cython is one way to ameliorate this, but it does not eliminate many fundamental differences between the languages; you need to be familiar with C semantics and understand what it is doing.
PyPy's C API works to some extent, but isn't very fast. If you are targeting PyPy, you should probably just write simple code with regular lists, and then let the JITter optimize it for you.
C extensions are harder to distribute than pure Python code because they need to be compiled. Compilation tends to be architecture and operating-system dependent, so you will need to ensure you are compiling for your target platform.
Going straight to C extensions may be using a sledgehammer to swat a fly, depending on your use case. You should first investigate NumPy and see if it is powerful enough to do whatever math you're trying to do. It will also be much faster than native Python, if used correctly.

Tim Peters answered why this is slow, but let's see how to improve it.
Sticking to your example of sum(range(...)) (factor 10 smaller than your example to fit into memory here):
import numpy
import array
L = list(range(10**7))
A = array.array('l', L)
N = numpy.array(L)
%timeit sum(L)
10 loops, best of 3: 101 ms per loop
%timeit sum(A)
1 loop, best of 3: 237 ms per loop
%timeit sum(N)
1 loop, best of 3: 743 ms per loop
This way also numpy needs to box/unbox, which has additional overhead. To make it fast one has to stay within the numpy c code:
%timeit N.sum()
100 loops, best of 3: 6.27 ms per loop
So from the list solution to the numpy version this is a factor 16 in runtime.
Let's also check how long creating those data structures takes
%timeit list(range(10**7))
1 loop, best of 3: 283 ms per loop
%timeit array.array('l', range(10**7))
1 loop, best of 3: 884 ms per loop
%timeit numpy.array(range(10**7))
1 loop, best of 3: 1.49 s per loop
%timeit numpy.arange(10**7)
10 loops, best of 3: 21.7 ms per loop
Clear winner: Numpy
Also note that creating the data structure takes about as much time as summing, if not more. Allocating memory is slow.
Memory usage of those:
sys.getsizeof(L)
90000112
sys.getsizeof(A)
81940352
sys.getsizeof(N)
80000096
So these take 8 bytes per number with varying overhead. For the range we use 32bit ints are sufficient, so we can safe some memory.
N=numpy.arange(10**7, dtype=numpy.int32)
sys.getsizeof(N)
40000096
%timeit N.sum()
100 loops, best of 3: 8.35 ms per loop
But it turns out that adding 64bit ints is faster than 32bit ints on my machine, so this is only worth it if you are limited by memory/bandwidth.

I noticed that typecode L is faster than l, and it also works in I and Q.
Python 3.8.5
Here is the code of the test.
Check it out d_d.
#!/usr/bin/python3
import inspect
from tqdm import tqdm
from array import array
def get_var_name(var):
"""
Gets the name of var. Does it from the out most frame inner-wards.
:param var: variable to get name from.
:return: string
"""
for fi in reversed(inspect.stack()):
names = [var_name for var_name, var_val in fi.frame.f_locals.items() if var_val is var]
if len(names) > 0:
return names[0]
def performtest(func, n, *args, **kwargs):
times = array('f')
times_append = times.append
for i in tqdm(range(n)):
st = time.time()
func(*args, **kwargs)
times_append(time.time() - st)
print(
f"Func {func.__name__} with {[get_var_name(i) for i in args]} run {n} rounds consuming |"
f" Mean: {sum(times)/len(times)}s | Max: {max(times)}s | Min: {min(times)}s"
)
def list_int(start, end, step=1):
return [i for i in range(start, end, step)]
def list_float(start, end, step=1):
return [i + 1e-1 for i in range(start, end, step)]
def array_int(start, end, step=1):
return array("I", range(start, end, step)) # speed I > i, H > h, Q > q, I~=H~=Q
def array_float(start, end, step=1):
return array("f", [i + 1e-1 for i in range(start, end, step)]) # speed f > d
if __name__ == "__main__":
performtest(list_int, 1000, 0, 10000)
performtest(array_int, 1000, 0, 10000)
performtest(list_float, 1000, 0, 10000)
performtest(array_float, 1000, 0, 10000)
Results
Result of the test

please note that 100000000 equals to 10^8 not to 10^7, and my results are as the folowwing:
100000000 == 10**8
# my test results on a Linux virtual machine:
#<L = list(range(100000000))> Time: 0:00:03.263585
#<A = array.array('l', range(100000000))> Time: 0:00:16.728709
#<L = list(range(10**8))> Time: 0:00:03.119379
#<A = array.array('l', range(10**8))> Time: 0:00:18.042187
#<A = array.array('l', L)> Time: 0:00:07.524478
#<sum(L)> Time: 0:00:01.640671
#<np.sum(L)> Time: 0:00:20.762153

Related

Is it possible to improve python performance for this code?

I have a simple code that:
Read a trajectory file that can be seen as a list of 2D arrays (list of positions in space) stored in Y
I then want to compute for each pair (scipy.pdist style) the RMSD
My code works fine:
trajectory = read("test.lammpstrj", index="::")
m = len(trajectory)
#.get_positions() return a 2d numpy array
Y = np.array([snapshot.get_positions() for snapshot in trajectory])
b = [np.sqrt(((((Y[i]- Y[j])**2))*3).mean()) for i in range(m) for j in range(i + 1, m)]
This code execute in 0.86 seconds using python3.10, using Julia1.8 the same kind of code execute in 0.46 seconds
I plan to have trajectory much larger (~ 200,000 elements), would it be possible to get a speed-up using python or should I stick to Julia?

You've mentioned that snapshot.get_positions() returns some 2D array, suppose of shape (p, q). So I expect that Y is a 3D array with some shape (m, p, q), where m is the number of snapshots in the trajectory. You also expect m to scale rather high.
Let's see a basic way to speed up the distance calculation, on the setting m=1000:
import numpy as np
# dummy inputs
m = 1000
p, q = 4, 5
Y = np.random.randn(m, p, q)
# your current method
def foo():
return [np.sqrt(((((Y[i]- Y[j])**2))*3).mean()) for i in range(m) for j in range(i + 1, m)]
# vectorized approach -> compute the upper triangle of the pairwise distance matrix
def bar():
u, v = np.triu_indices(Y.shape[0], 1)
return np.sqrt((3 * (Y[u] - Y[v]) ** 2).mean(axis=(-1, -2)))
# Check for correctness
out_1 = foo()
out_2 = bar()
print(np.allclose(out_1, out_2))
# True
If we test the time required:
%timeit -n 10 -r 3 foo()
# 3.16 s ± 50.3 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
The first method is really slow, it takes over 3 seconds for this calculation. Let's check the second method:
%timeit -n 10 -r 3 bar()
# 97.5 ms ± 405 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)
So we have a ~30x speedup here, which would make your large calculation in python much more feasible than using the original code. Feel free to test out with other sizes of Y to see how it scales compared to the original.
JIT
In addition, you can also try out JIT, mainly jax or numba. It is fairly simple to port the function bar with jax.numpy, for example:
import jax
import jax.numpy as jnp
#jax.jit
def jit_bar(Y):
u, v = jnp.triu_indices(Y.shape[0], 1)
return jnp.sqrt((3 * (Y[u] - Y[v]) ** 2).mean(axis=(-1, -2)))
# check for correctness
print(np.allclose(bar(), jit_bar(Y)))
# True
If we test the time of the jitted jnp op:
%timeit -n 10 -r 3 jit_bar(Y)
# 10.6 ms ± 678 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)
So compared to the original, we could reach even up to ~300x speed.
Note that not every operation can be converted to jax/jit so easily (this particular problem is conveniently suitable), so the general advice is to simply avoid python loops and use numpy's broadcasting/vectorization capabilities, like in bar().

Stick to Julia.
If you already made it in a language which runs faster, why are you trying to use python in the first place?

Your question is about speeding up Python, relative to Julia, so I'd like to offer some Julia code for comparison.
Since your data is most naturally expressed as a list of 4x5 arrays, I suggest expressing it as a vector of SMatrixes:
sumdiff2(A, B) = sum((A[i] - B[i])^2 for i in eachindex(A, B))
function dists(Y)
M = length(Y)
V = Vector{float(eltype(eltype(Y)))}(undef, sum(1:M-1))
Threads.#threads for i in eachindex(Y)
ii = sum(M-i+1:M-1) # don't worry about this sum
for j in i+1:lastindex(Y)
ind = ii + (j-i)
V[ind] = sqrt(3 * sumdiff2(Y[i], Y[j])/length(Y[i]))
end
end
return V
end
using Random: randn
using StaticArrays: SMatrix
Ys = [randn(SMatrix{4,5,Float64}) for _ in 1:1000];
Benchmarks:
# single-threaded
julia> using BenchmarkTools
julia> #btime dists($Ys);
6.561 ms (2 allocations: 3.81 MiB)
# multi-threaded with 6 cores
julia> #btime dists($Ys);
1.606 ms (75 allocations: 3.82 MiB)
I was not able to install jax on my computer, but when comparing with #Mercury's numpy code I got
foo: 5.5seconds
bar: 179ms
i.e. approximately 3400x speedup over foo.
It is possible to write this as a one-liner at a ~2-3x performance cost.

While Python tends to be slower than Julia for many tasks, it is possible to write numerical codes as fast as Julia in Python using Numba and plain loops. Indeed, Numba is based on LLVM-Lite which is basically a JIT-compiler based on the LLVM toolchain. The standard implementation of Julia also use a JIT and the LLVM toolchain. This means the two should behave pretty closely besides the overhead introduced by the languages that are negligible once the computation is performed in parallel (because the resulting computation will be memory-bound on nearly all modern platforms).
This computation can be parallelized in both Julia and Python (still using Numba). While writing a sequential computation is quite straightforward, writing a parallel computation is if bit more complex. Indeed, computing the upper triangular values can result in an imbalanced workload and so to a sub-optimal execution time. An efficient strategy is to compute, for each iteration, a pair of lines: one comes from the top of the upper triangular part and one comes from the bottom. The top line contains m-i items while the bottom one contains i+1 items. In the end, there is m+1 items to compute per iteration so the number of item is independent of the iteration number. This results in a much better load-balancing. The line of the middle needs to be computed separately regarding the size of the input array.
Here is the final implementation:
import numba as nb
import numpy as np
#nb.njit(inline='always', fastmath=True)
def compute_line(tmp, res, i, m):
offset = (i * (2 * m - i - 1)) // 2
factor = 3.0 / n
for j in range(i + 1, m):
s = 0.0
for k in range(n):
s += (tmp[i, k] - tmp[j, k]) ** 2
res[offset] = np.sqrt(s * factor)
offset += 1
return res
#nb.njit('()', parallel=True, fastmath=True)
def fastest():
m, n = Y.shape[0], Y.shape[1] * Y.shape[2]
res = np.empty(m*(m-1)//2)
tmp = Y.reshape(m, n)
for i in nb.prange(m//2):
compute_line(tmp, res, i, m)
compute_line(tmp, res, m-i-1, m)
if m % 2 == 1:
compute_line(tmp, res, (m+1)//2, m)
return res
# [...] same as others
%timeit -n 100 fastest()
Results
Here are performance results on my machine (with a i5-9600KF having 6 cores):
foo (seq, Python, Mercury): 4910.7 ms
bar (seq, Python, Mercury): 134.2 ms
jit_bar (seq, Python, Mercury): ???
dists (seq, Julia, DNF) 6.9 ms
dists (par, Julia, DNF) 2.2 ms
fastest (par, Python, me): 1.5 ms <-----
(Jax does not work on my machine so I cannot test it yet)
This implementation is the fastest one and succeed to beat the best Julia code so far.
Optimal implementation
Note that for large arrays like (200_000,4,5), all implementations provided so far are inefficient since they are not cache friendly. Indeed, the input array will take 32 MiB and will not for on the cache of most modern processors (and even if it could, one need to consider the space needed for the output and the fact that caches are not perfect). This can be fixed using tiling, at the expense of an even more complex code. I think such an implementation should be optimal if you use Z-order curves.

Python: Very slow execution loops

I am writing a code for proposing typo correction using HMM and Viterbi algorithm. At some point for each word in the text I have to do the following. (lets assume I have 10,000 words)
#FYI Windows 10, 64bit, interl i7 4GRam, Python 2.7.3
import numpy as np
import pandas as pd
for k in range(10000):
tempWord = corruptList20[k] #Temp word read form the list which has all of the words
delta = np.zeros(26, len(tempWord)))
sai = np.chararray(26, len(tempWord)))
sai[:] = '#'
# INITIALIZATION DELTA
for i in range(26):
delta[i][0] = #CALCULATION matrix read and multiplication each cell is different
# INITILIZATION END
# 6.DELTA CALCULATION
for deltaIndex in range(1, len(tempWord)):
for j in range(26):
tempDelta = 0.0
maxDelta = 0.0
maxState = ''
for i in range(26):
# CALCULATION to fill each cell involve in:
# 1-matrix read and multiplication
# 2 Finding Column Max
# logical operation and if-then-else operations
# 7. SAI BACKWARD TRACKING
delta2 = pd.DataFrame(delta)
sai2 = pd.DataFrame(sai)
proposedWord = np.zeros(len(tempWord), str)
editId = 0
for col in delta2.columns:
# CALCULATION to fill each cell involve in:
# 1-matrix read and multiplication
# 2 Finding Column Max
# logical operation and if-then-else operations
editList20.append(''.join(editWord))
#END OF LOOP
As you can see it is computationally involved and When I run it takes too much time to run.
Currently my laptop is stolen and I run this on Windows 10, 64bit, 4GRam, Python 2.7.3
My question: Anybody can see any point that I can use to optimize? Do I have to delete the the matrices I created in the loop before loop goes to next round to make memory free or is this done automatically?
After the below comments and using xrange instead of range the performance increased almost by 30%. I am adding the screenshot here after this change.

I don't think that range discussion makes much difference. With Python3, where range is the iterator, expanding it into a list before iteration doesn't change time much.
In [107]: timeit for k in range(10000):x=k+1
1000 loops, best of 3: 1.43 ms per loop
In [108]: timeit for k in list(range(10000)):x=k+1
1000 loops, best of 3: 1.58 ms per loop
With numpy and pandas the real key to speeding up loops is to replace them with compiled operations that work on the whole array or dataframe. But even in pure Python, focus on streamlining the contents of the iteration, not the iteration mechanism.
======================
for i in range(26):
delta[i][0] = #CALCULATION matrix read and multiplication
A minor change: delta[i, 0] = ...; this is the array way of addressing a single element; functionally it often is the same, but the intent is clearer. But think, can't you set all of that column as once?
delta[:,0] = ...
====================
N = len(tempWord)
delta = np.zeros(26, N))
etc
In tight loops temporary variables like this can save time. This isn't tight, so here is just adds clarity.
===========================
This one ugly nested triple loop; admittedly 26 steps isn't large, but 26*26*N is:
for deltaIndex in range(1,N):
for j in range(26):
tempDelta = 0.0
maxDelta = 0.0
maxState = ''
for i in range(26):
# CALCULATION
# 1-matrix read and multiplication
# 2 Finding Column Max
# logical operation and if-then-else operations
Focus on replacing this with array operations. It's those 3 commented lines that need to be changed, not the iteration mechanism.
================
Make proposedWord a list rather than array might be faster. Small list operations are often faster than array one, since numpy arrays have a creation overhead.
In [136]: timeit np.zeros(20,str)
100000 loops, best of 3: 2.36 µs per loop
In [137]: timeit x=[' ']*20
1000000 loops, best of 3: 614 ns per loop
You have to careful when creating 'empty' lists that the elements are truly independent, not just copies of the same thing.
In [159]: %%timeit
x = np.zeros(20,str)
for i in range(20):
x[i] = chr(65+i)
.....:
100000 loops, best of 3: 14.1 µs per loop
In [160]: timeit [chr(65+i) for i in range(20)]
100000 loops, best of 3: 7.7 µs per loop

As noted in the comments, the behavior of range changed between Python 2 and 3.
In 2, range constructs an entire list populated with the numbers to iterate over, then iterates over the list. Doing this in a tight loop is very expensive.
In 3, range instead constructs a simple object that (as far as I know), consists only of 3 numbers: the starting number, the step (distance between numbers), and the end number. Using simple math, you can calculate any point along the range instead of needing to iterate necessarily. This makes "random access" on it O(1) instead of O(n) when the entire list is interated, and prevents the creation of a costly list.
In 2, use xrange to iterate over a range object instead of a list.
(#Tom: I'll delete this if you post an answer).

It's hard to see exactly what you need to do because of the missing code, but it's clear that you need to learn how to vectorize your numpy code. This can lead to a 100x speedup.
You can probably get rid of all the inner for-loops and replace them with vectorized operations.
eg. instead of
for i in range(26):
delta[i][0] = #CALCULATION matrix read and multiplication each cell is differen
do
delta[:, 0] = # Vectorized form of whatever operation you were going to do.

Possible to use numba.guvectorize to emulate parallel forall / prange?

As a user of Python for data analysis and numerical calculations, rather than a real "coder", I had been missing a really low-overhead way of distributing embarrassingly parallel loop calculations on several cores.
As I learned, there used to be the prange construct in Numba, but it was abandoned because of "instability and performance issues".
Playing with the newly open-sourced #guvectorize decorator I found a way to use it for virtually no-overhead emulation of the functionality of late prange.
I am very happy to have this tool at hand now, thanks to the guys at Continuum Analytics, and did not find anything on the web explicitly mentioning this use of #guvectorize. Although it may be trivial to people who have been using NumbaPro earlier, I'm posting this for all those fellow non-coders out there (see my answer to this "question").

Consider the example below, where a two-level nested for loop with a core doing some numerical calculation involving two input arrays and a function of the loop indices is executed in four different ways. Each variant is timed with Ipython's %timeit magic:
naive for loop, compiled using numba.jit
forall-like construct using numba.guvectorize, executed in a single thread (target = "cpu")
forall-like construct using numba.guvectorize, executed in as many threads as there are cpu "cores" (in my case hyperthreads) (target = "parallel")
same as 3., however calling the "guvectorized" forall with the sequence of "parallel" loop indices randomly permuted
The last one is done because (in this particular example) the inner loop's range depends on the value of the outer loop's index. I don't know how exactly the dispatchment of gufunc calls is organized inside numpy, but it appears as if the randomization of "parallel" loop indices achieves slightly better load balancing.
On my (slow) machine (1st gen core i5, 2 cores, 4 hyperthreads) I get the timings:
1 loop, best of 3: 8.19 s per loop
1 loop, best of 3: 8.27 s per loop
1 loop, best of 3: 4.6 s per loop
1 loop, best of 3: 3.46 s per loop
Note: I'd be interested if this recipe readily applies to target="gpu" (it should do, but I don't have access to a suitable graphics card right now), and what's the speedup. Please post!
And here's the example:
import numpy as np
from numba import jit, guvectorize, float64, int64
#jit
def naive_for_loop(some_input_array, another_input_array, result):
for i in range(result.shape[0]):
for k in range(some_input_array.shape[0] - i):
result[i] += some_input_array[k+i] * another_input_array[k] * np.sin(0.001 * (k+i))
#guvectorize([(float64[:],float64[:],int64[:],float64[:])],'(n),(n),()->()', nopython=True, target='parallel')
def forall_loop_body_parallel(some_input_array, another_input_array, loop_index, result):
i = loop_index[0] # just a shorthand
# do some nontrivial calculation involving elements from the input arrays and the loop index
for k in range(some_input_array.shape[0] - i):
result[0] += some_input_array[k+i] * another_input_array[k] * np.sin(0.001 * (k+i))
#guvectorize([(float64[:],float64[:],int64[:],float64[:])],'(n),(n),()->()', nopython=True, target='cpu')
def forall_loop_body_cpu(some_input_array, another_input_array, loop_index, result):
i = loop_index[0] # just a shorthand
# do some nontrivial calculation involving elements from the input arrays and the loop index
for k in range(some_input_array.shape[0] - i):
result[0] += some_input_array[k+i] * another_input_array[k] * np.sin(0.001 * (k+i))
arg_size = 20000
input_array_1 = np.random.rand(arg_size)
input_array_2 = np.random.rand(arg_size)
result_array = np.zeros_like(input_array_1)
# do single-threaded naive nested for loop
# reset result_array inside %timeit call
%timeit -r 3 result_array[:] = 0.0; naive_for_loop(input_array_1, input_array_2, result_array)
result_1 = result_array.copy()
# do single-threaded forall loop (loop indices in-order)
# reset result_array inside %timeit call
loop_indices = range(arg_size)
%timeit -r 3 result_array[:] = 0.0; forall_loop_body_cpu(input_array_1, input_array_2, loop_indices, result_array)
result_2 = result_array.copy()
# do multi-threaded forall loop (loop indices in-order)
# reset result_array inside %timeit call
loop_indices = range(arg_size)
%timeit -r 3 result_array[:] = 0.0; forall_loop_body_parallel(input_array_1, input_array_2, loop_indices, result_array)
result_3 = result_array.copy()
# do forall loop (loop indices scrambled for better load balancing)
# reset result_array inside %timeit call
loop_indices_scrambled = np.random.permutation(range(arg_size))
loop_indices_unscrambled = np.argsort(loop_indices_scrambled)
%timeit -r 3 result_array[:] = 0.0; forall_loop_body_parallel(input_array_1, input_array_2, loop_indices_scrambled, result_array)
result_4 = result_array[loop_indices_unscrambled].copy()
# check validity
print(np.all(result_1 == result_2))
print(np.all(result_1 == result_3))
print(np.all(result_1 == result_4))

Python for each faster than for indexed?

In python, which is faster?
1
for word in listOfWords:
doSomethingToWord(word)
2
for i in range(len(listOfWords)):
doSomethingToWord(listOfWords[i])
Of course I'd use xrange in python 2.x.
My assumption is 1. is faster than 2. If so, why is it?

Use Python's timeit module to answer this kind of question:
duncan#ubuntu:~$ python -m timeit -s "listOfWords=['hello']*1000" "for word in listOfWords: len(word)"
10000 loops, best of 3: 37.2 usec per loop
duncan#ubuntu:~$ python -m timeit -s "listOfWords=['hello']*1000" "for i in range(len(listOfWords)): len(listOfWords[i])"
10000 loops, best of 3: 52.1 usec per loop

Instead of asking this questions, you can always try do them by yourself. It is not hard.
Super simple benchmarking will show you the difference.
from datetime import datetime
arr = [4 for _ in xrange(10**8)]
startTime = datetime.now()
for i in arr:
i
print datetime.now() - startTime
startTime = datetime.now()
for i in xrange(len(arr)):
arr[i]
print datetime.now() - startTime
On my machine it is:
0:00:04.822513
0:00:05.676396
Note that the list you are iterating should be pretty big to see the difference. The second loop is longer because each time you need to make a look up by index (arr[i]) and also to generate the values for xrange.
Please do not spend too much time in mostly useless microoptimization, rather try to look whether you can improve the computational complexity of your inner loop functions.

simply try timeit.
In [2]: def solve(listOfWords):
for word in range(len(listOfWords)):
pass
...:
In [3]: %timeit solve(xrange(10**5))
100 loops, best of 3: 4.34 ms per loop
In [4]: def solve(listOfWords):
for word in listOfWords:
pass
...:
In [5]: %timeit solve(xrange(10**5))
1000 loops, best of 3: 1.84 ms per loop

In addition to the speed advantage, 1 is "cleaner-looking", but also will work for sequences that do not support len, namely generator expressions and the results from generator functions. To use solution 2, you would first have to convert the generator to a list in order to get its length if you could. But what if the generator is generating the list of all prime numbers, and doSomething is looking for the first value > 100?
for num in prime_number_generator():
if num > 100: return num
There is no way to convert this to the second form, since this generator has no end.
Also, what if it is very expensive to create the elements of the list (as in fetching from a database, or remote web server)? If you are looking for a matching value out of a generated set of N values, with #1 you could exit as soon as you found a match, and avoid on average the generation of N/2 values. To use #2, you first have to generate all N values in order to get the length in order to make the range.
There is a reason Python 3 converted many builtins to return iterators instead of lists - they are more flexible.
What is Pythonic?
"for i in range(len(seq)):"? No.
Use "for x in seq:"

What is the preferred way to take the modulus by a power of 2 of an int in Python

Imagine that you have some counter or other data element that needs to be stored in a field of a binary protocol. The field naturally has some fixed number n of bits and the protocol specifies that you should store the n least significant bits of the counter, so that it wraps around when it is too large. One possible way to implement that is actually taking the modulus by a power of two:
field_value = counter % 2 ** n
But this certainly isn't the most efficient way and maybe not even the easiest to understand, taking into account that the specification is talking about the least significant bits and does not mention a modulus operation. Thus, investigating alternatives is appropriate. Some examples are:
field_value = counter % (1 << n)
field_value = counter & (1 << n) - 1
field_value = counter & ~(-1 << 8)
What is the way preferred by experienced Python programmers to implement such a requirement trying to maximize code clarity without sacrificing too much performance?
There is of course no right or wrong answer to this question, so I would like to use this question to collect all the reasonable implementations of this seemingly trivial requirement. An answer should list the alternatives and shortly describe in what circumstance what alternative would preferably be used.

Bit shifting and bitwise operations are more readable in your case. Because it simply tells the reader, you are doing bitwise operations here. If you use numeric operation, the reader may not be able to understand what does it mean by moduloing that number.
Talking about performance, actually you don't have to worry too much about this in Python. Because operation to Python object itself is expensive enough, by either doing it in numeric operations or bitwise operations, it simply doesn't matter. Here I explain it in a visual way
<-------------- Python object operation cost --------------><- bit op ->
<-------------- Python object operation cost --------------><----- num op ----->
This is just a simple rough idea of what it costs to perform a simplest bit operation or number operation. As you can see Python object operation cost takes the majority, so it doesn't matter you use bitwise or numeric, the difference is too small can be ignored.
If you really need performance, you have to process massive amount of data, you should consider
Write the logic in C/C++ module for Python, you can use library like Boost.Python
Use third party library for mass number processing such as numpy

you should simply throw away the top bits.
#field_value = counter & (1 << n) - 1
field_value = counter & ALLOWED_BIT_WIDTH
If this was implemented in an embedded device, the registers used could be the limiting factor. In my experience this is way it is normally done.
The "limitation" in the protocol is a way of constraining the overhead bandwidth needed by the protocol.

It will be dependent on the python implementation probably, but in CPython 2.6, it looks like this:
In [1]: counter = 0xfedcba9876543210
In [10]: %timeit counter % 2**15
1000000 loops, best of 3: 304 ns per loop
In [11]: %timeit counter % (1<<15)
1000000 loops, best of 3: 302 ns per loop
In [12]: %timeit counter & ((1<<15)-1)
10000000 loops, best of 3: 104 ns per loop
In [13]: %timeit counter & ~(1<<15)
10000000 loops, best of 3: 170 ns per loop
In this case, counter & ((1<<15)-1) is the clear winner. Interesting is that 2**15 and 1<<15 take the same amount of time (more or less); I am guessing Python internally optimizes this case and 2**15 -> 1<<15 anyways.
I once wrote a class that lets you just do this:
bc = BitSliceLong(counter)
bc = bc[15:0]
derived from long, but it's a more general implementation (lets you take any range of the bits, not just x:0) and the extra overhead for that makes it slower by an order of magnitude, even though it's using the same method inside.
Edit: BTW, precalculating the values doesn't appear to provide any benefit - the dominant factor here is not the actual math operation. If we do
cx_mask = 2**15
counter % cx_mask
the time is the same as when it had to calculate 2**15. This was also true for our 'best case' - precalculating ((1<<15)-1) has no benefit.
Also, in the previous case, I used a large number that is implemented as a long in python. This is not really a native type - it supports arbitrary length numbers, and so needs to handle nearly anything, so implementing operations is not just a single ALU call - it involves a series of bit-shifting and arithmetic operations.
If you can keep the counter below sys.maxint, you'll be using int types instead, and they both appear to be faster & also more dominated by actual math code:
In [55]: %timeit x % (1<<15)
10000000 loops, best of 3: 53.6 ns per loop
In [56]: %timeit x & ((1<<15)-1)
10000000 loops, best of 3: 49.2 ns per loop
In [57]: %timeit x % (2**15)
10000000 loops, best of 3: 53.9 ns per loop
These are all about the same, so it doesn't matter which one you use here really. (mod slightly slower, but within random variation). It makes sense for div/mod to be an expensive operation on very large numbers, with a more complex algorithm, while for 'small' ints it can be done in hardware.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why are Python's arrays slow? - python

Related

Is it possible to improve python performance for this code?

Python: Very slow execution loops

Possible to use numba.guvectorize to emulate parallel forall / prange?

Python for each faster than for indexed?

What is the preferred way to take the modulus by a power of 2 of an int in Python

Categories

Resources