Doing for loop computations faster - python

I have a for loop doing some operation on the elements of an array. There are 1e5 elements in the array
import numpy as np
A=np.array([1,2,3,4..........100000)]
for i in range(0,len(A)):
A[i]=(A[i]*2+A[i]*4)**(1/3)
I want to obtain parallelisation in the above code so that each execution of the for loop goes to a different core to make the code execution faster. I have a workstation with 48 cores. How to achieve this parallel processing in python? Please help.

Don't bother parallelizing just yet. Right now, you're taking no advantage of numpy vectorization; you may as well be using Python list (or maybe array.array) for all the benefit numpy is giving you.
Actually use the vectorization features, and the overhead should drop by several orders of magnitude:
import numpy as np
A = np.array([1,2,3,4..........100000]) # If this is actually the values you want, use np.arange(1, 100000+1) to speed it up
A = (A * 6) ** (1 / 3)
# If the result should truncate back to int64, not convert to doubles, cast back at the end
A = A.astype(np.int64)
(A * 6) ** (1 / 3) does the same work as the for loop did, but much faster (you could match the original code more closely with A = (A * 2 + A * 4) ** (1/3), but multiplying by 2 and 4 separately and adding them together is pointless when you could just multiply by 6 directly). The final (optional, depending on intent) line gets exact equivalent behavior of the original loop by truncating back to the original integer dtype.
Comparing performance with ipython %%timeit magic for a microbenchmark:
In [2]: %%timeit
...: A = np.arange(1, 100000+1)
...: for i in range(len(A)):
...: A[i] = (A[i]*2 + A[i]*4) ** (1/3)
...:
427 ms ± 6.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [3]: %%timeit
...: A = np.arange(1, 100000+1)
...: A = (A * 6) ** (1/3)
...:
2.72 ms ± 51 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The vectorized code takes about 0.6% of the time taken by the naive loop; merely parallelizing the naive loop would never come close to achieving that sort of speedup. Adding the .astype(np.int64) cast only increases runtime by about 6%, still a trivial fraction of what the original for loop required.

Let numpy do the hard work.
A = (A*2+A*4)**(1/3)

Related

Any chance of making this faster? (numpy.einsum)

I'm trying to multiply three arrays (A x B x A), with the dimensions (19000, 3) x (19000, 3, 3) x (19000, 3) so that at the end I'm getting a 1d-array with the size (19000), so I want to multiply only along the last one/two dimensions.
I've got it working with np.einsum() but I'm wondering if there is any way of making this faster, as this is the bottleneck of my whole code.
np.einsum('...i,...ij,...j', A, B, A)
I've already tried it with two separated np.einsum() calls, but that gave me the same performance:
np.einsum('...i, ...i', np.einsum('...i,...ij', A, B), A)
As well I've already tried the # operator and adding some additional axes, but that also didn't make it faster:
(A[:, None]#B#A[...,None]).squeeze()
I've tried to get it working with np.inner(), np.dot(), np.tensordot() and np.vdot(), but these never gave me the same results, so I couldn't compare them.
Any other ideas? Is there any way I could get a better performance?
I've already had a quick look at Numba, but as Numba doesn't support np.einsum() and many other NumPy functions, I would have to rewrite a lot of code.
You could use Numba
In the beginning it is always a good idea, to look what np.einsum does. With optimize==optimal it is usually really good to find a way of contraction, which has less FLOPs. In this case there is actually only a minor optimization possible and the intermediate array is relatively large (I will stick to the naive version). It should also be mentioned that contractions with very small (fixed?) dimensions are a quite special case. This is also a reason why it is quite easy to outperfom np.einsum here (unrolling etc..., which a compiler does if it knows that a loop consists only of 3 elements)
import numpy as np
A=np.random.rand(19000, 3)
B=np.random.rand(19000, 3, 3)
print(np.einsum_path('...i,...ij,...j', A, B, A,optimize="optimal")[1])
"""
Complete contraction: si,sij,sj->s
Naive scaling: 3
Optimized scaling: 3
Naive FLOP count: 5.130e+05
Optimized FLOP count: 4.560e+05
Theoretical speedup: 1.125
Largest intermediate: 5.700e+04 elements
--------------------------------------------------------------------------
scaling current remaining
--------------------------------------------------------------------------
3 sij,si->js sj,js->s
2 js,sj->s s->s
"""
Numba implementation
import numba as nb
#si,sij,sj->s
#nb.njit(fastmath=True,parallel=True,cache=True)
def nb_einsum(A,B):
#check the input's at the beginning
#I assume that the asserted shapes are always constant
#This makes it easier for the compiler to optimize
assert A.shape[1]==3
assert B.shape[1]==3
assert B.shape[2]==3
#allocate output
res=np.empty(A.shape[0],dtype=A.dtype)
for s in nb.prange(A.shape[0]):
#Using a syntax like that is also important for performance
acc=0
for i in range(3):
for j in range(3):
acc+=A[s,i]*B[s,i,j]*A[s,j]
res[s]=acc
return res
Timings
#warmup the first call is always slower
#(due to compilation or loading the cached function)
res=nb_einsum(A,B)
%timeit nb_einsum(A,B)
#43.2 µs ± 1.22 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.einsum('...i,...ij,...j', A, B, A,optimize=True)
#450 µs ± 8.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.einsum('...i,...ij,...j', A, B, A)
#977 µs ± 4.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
np.allclose(np.einsum('...i,...ij,...j', A, B, A,optimize=True),nb_einsum(A,B))
#True

Why doesn't numpy.zeros allocate all of its memory on creation? And how can I force it to?

I want to create an empty Numpy array in Python, to later fill it with values. The code below generates a 1024x1024x1024 array with 2-byte integers, which means it should take at least 2GB in RAM.
>>> import numpy as np; from sys import getsizeof
>>> A = np.zeros((1024,1024,1024), dtype=np.int16)
>>> getsizeof(A)
2147483776
From getsizeof(A), we see that the array takes 2^31 + 128 bytes (presumably of header information.) However, using my task manager, I can see Python is only taking 18.7 MiB of memory.
Assuming the array is compressed, I assigned random values to each memory slot so that it could not be.
>>> for i in range(1024):
... for j in range(1024):
... for k in range(1024):
... A[i,j,k] = np.random.randint(32767, dtype = np.int16)
The loop is still running, and my RAM is slowly increasing (presumably as the arrays composing A inflate with the incompresible noise.) I'm assuming it would make my code faster to force numpy to expand this array from the beginning. Curiously, I haven't seen this documented anywhere!
So, 1. Why does numpy do this? and 2. How can I force numpy to allocate memory?
A neat answer to your first question can also be found in this StackOverflow answer.
To answer your second question, you can force the memory to be allocated as follows in a more or less efficient manner:
A = np.empty((1024,1024,1024), dtype=np.int16)
A.fill(0)
because then the memory is touched.
At my machine with my setup,
A = np.empty(0)
A.resize((1024, 1024, 1024))
also does the trick, but I cannot find this behavior documented, and this might be an implementation detail; realloc is used under the hood in numpy.
Let's look at some timings for a smaller case:
In [107]: A = np.zeros(10000,int)
In [108]: for i in range(A.shape[0]): A[i]=np.random.randint(327676)
We don't need to make A 3d to get the same effect; 1d of the same total size would be just as good.
In [109]: timeit for i in range(A.shape[0]): A[i]=np.random.randint(327676)
37 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Now compare that time to the alternative of generating the random numbers with one call:
In [110]: timeit np.random.randint(327676, size=A.shape)
185 µs ± 905 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Much much faster.
If we do the same loop, but simply assign the random number to a variable (and throw it away):
In [111]: timeit for i in range(A.shape[0]): x=np.random.randint(327676)
32.3 ms ± 171 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The times are nearly the same as the original case. Assigning the values to the zeros array is not the big time consumer.
I'm not testing a very large case as you are, and my A has already been initialized in full. So you are welcome repeat the comparisons with your size. But I think the pattern will still hold - iteration 1024x1024x1024 times (100,000 larger than my example) is the big time consumer, not the memory allocation task.
Something else you might experimenting with: just iterate on the first dimension of A, and assign randomint shaped like the other 2 dimensions. For example, expanding my A with a size 10 dimension:
In [112]: A = np.zeros((10,10000),int)
In [113]: timeit for i in range(A.shape[0]): A[i]=np.random.randint(327676,size=A.shape[1])
1.95 ms ± 31.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
A is 10x larger than in [107], but take 16x less time to fill, because it only as to iterate 10x. In numpy if you must iterate, try to do it a few times on a more complex task.
(timeit repeats the test many times (e.g. 7*10), so it isn't going to capture any initial memory allocation step, even if I use a large enough array for that to matter).

Why python broadcasting in the example below is slower than a simple loop?

I have an array of vectors and compute the norm of their diffs vs the first one.
When using python broadcasting, the calculation is significantly slower than doing it via a simple loop. Why?
import numpy as np
def norm_loop(M, v):
n = M.shape[0]
d = np.zeros(n)
for i in range(n):
d[i] = np.sum((M[i] - v)**2)
return d
def norm_bcast(M, v):
n = M.shape[0]
d = np.zeros(n)
d = np.sum((M - v)**2, axis=1)
return d
M = np.random.random_sample((1000, 10000))
v = M[0]
%timeit norm_loop(M, v)
25.9 ms
%timeit norm_bcast(M, v)
38.5 ms
I have Python 3.6.3 and Numpy 1.14.2
To run the example in google colab:
https://drive.google.com/file/d/1GKzpLGSqz9eScHYFAuT8wJt4UIZ3ZTru/view?usp=sharing
Memory access.
First off, the broadcast version can be simplified to
def norm_bcast(M, v):
return np.sum((M - v)**2, axis=1)
This still runs slightly slower than the looped version.
Now, conventional wisdom says that vectorized code using broadcasting should always be faster, which in many cases isn't true (I'll shamelessly plug another of my answers here). So what's happening?
As I said, it comes down to memory access.
In the broadcast version every element of M is subtracted from v. By the time the last row of M is processed the results of processing the first row have been evicted from cache, so for the second step these differences are again loaded into cache memory and squared. Finally, they are loaded and processed a third time for the summation. Since M is quite large, parts of the cache are cleared on each step to acomodate all of the data.
In the looped version each row is processed completely in one smaller step, leading to fewer cache misses and overall faster code.
Lastly, it is possible to avoid this with some array operations by using einsum.
This function allows mixing matrix multiplications and summations.
First, I'll point out it's a function that has rather unintuitive syntax compared to the rest of numpy, and potential improvements often aren't worth the extra effort to understand it.
The answer may also be slightly different due to rounding errors.
In this case it can be written as
def norm_einsum(M, v):
tmp = M-v
return np.einsum('ij,ij->i', tmp, tmp)
This reduces it to two operations over the entire array - a subtraction, and calling einsum, which performs the squaring and summation.
This gives a slight improvement:
%timeit norm_bcast(M, v)
30.1 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit norm_loop(M, v)
25.1 ms ± 37.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit norm_einsum(M, v)
21.7 ms ± 65.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Squeezing out maximum performance
On vectorized operations you clearly have a bad cache behaviour. But the calculation itsef is also slow due to not exploiting modern SIMD instructions (AVX2,FMA). Fortunately it isn't really complicated to overcome this issues.
Example
import numpy as np
import numba as nb
#nb.njit(fastmath=True,parallel=True)
def norm_loop_improved(M, v):
n = M.shape[0]
d = np.empty(n,dtype=M.dtype)
#enables SIMD-vectorization
#if the arrays are not aligned
M=np.ascontiguousarray(M)
v=np.ascontiguousarray(v)
for i in nb.prange(n):
dT=0.
for j in range(v.shape[0]):
dT+=(M[i,j]-v[j])*(M[i,j]-v[j])
d[i]=dT
return d
Performance
M = np.random.random_sample((1000, 1000))
norm_loop_improved: 0.11 ms**, 0.28ms
norm_loop: 6.56 ms
norm_einsum: 3.84 ms
M = np.random.random_sample((10000, 10000))
norm_loop_improved:34 ms
norm_loop: 223 ms
norm_einsum: 379 ms
** Be careful when measuring performance
The first result (0.11ms) comes from calling the function repeadedly with the same data. This would need 77 GB/s reading-throuput from RAM, which is far more than my DDR3 Dualchannel-RAM is capable of. Due to the fact that calling a function with the same input parameters successively isn't realistic at all, we have to modify the measurement.
To avoid this issue we have to call the same function with different data at least twice (8MB L3-cache, 8MB data) and than divide the result by two to clear all the caches.
The relative performance of this methods also differ on array sizes (have a look at the einsum results).

how to speed up the algorithm about boolean

If I have a very large piece of data, and I want to find out some specific elements and convert them from bool to number. For example, I want to find whether the element is in the interval (0.3,0.4), and convert True to 1 and False to 0.
i=np.random.rand(1000,1000,1000)
j=((0.3<i)*(i<0.4))*1
Does j=((0.3<i)&(i<0.4))*1 work the same as the expression above?
I know bool*bool is time-consuming and exploit a huge memory, and so is bool convert to number. Then how can I seed up the algorithm and save memory? Is there a way to evaluate 0.3<i<0.4 quickly?
Yes, for boolean arrays & and * are identical because both are only True if both operands are True, otherwise False.
You already found out that each operation creates a temporary array (although newer NumPy versions might be optimized in that respect), so you have one temporary boolean array for each <, one for the * or the & and then you create an integer array with the * 1. Without using additional libraries you can't avoid that. NumPy is fast because it does the loops in C but that means you have to deal with temporary arrays.
But with additional libraries you actually can speed that up and make it more memory-efficient.
Numba:
import numba as nb
import numpy as np
#nb.njit
def numba_func(arr, lower, upper):
res = np.zeros(arr.size, dtype=np.int8)
arr_raveled = arr.ravel()
for idx in range(arr.size):
res[idx] = lower < arr_raveled[idx] < upper
return res.reshape(arr.shape)
>>> numba_func(i, 0.3, 0.4) # sample call
Numexpr
import numexpr as ne
ne.evaluate('((0.3<i)&(i<0.4))*1')
However numexpr is more of a black-box, you don't control how much memory it needs, but in most cases where you deal with multiple element-wise NumPy operations it's very fast and much more memory efficient than NumPy.
Cython
I'm using IPython magic here. If you don't use IPython or Jupyter you probably need to cythonize it yourself.
%load_ext cython
%%cython
import numpy as np
cimport numpy as cnp
cpdef cnp.int8_t[:] cython_func(double[:] arr, double lower, double upper):
cdef Py_ssize_t idx
cdef cnp.int8_t[:] res = np.empty(len(arr), dtype=np.int8)
for idx in range(len(arr)):
res[idx] = lower < arr[idx] < upper
return res
Given that I used 1D-memoryviews here, you need to cast it to an array and reshape it afterwards:
np.asarray(cython_func(i.ravel(), 0.3, 0.4)).reshape(i.shape) # sample call
There are probably better ways to get around the ravel, asarray and reshape but those require that you know the dimension of your array.
Timing
I use a smaller array because I don't have much RAM but you can easily change the numbers:
i = np.random.random((1000, 1000, 10))
%timeit numba_func(i, 0.3, 0.4)
52.1 ms ± 3.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit ne.evaluate('((0.3<i)&(i<0.4))*1')
77.1 ms ± 6.59 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.asarray(cython_func(i.ravel(), 0.3, 0.4)).reshape(i.shape)
146 ms ± 3.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit ((0.3<i)&(i<0.4))*1
180 ms ± 2.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Yes, the expression works the same. Check it with
jmult = ((0.3<i)*(i<0.4))*1
jand = ((0.3<i)&(i<0.4))*1
jand == jmult

Python's sum vs. NumPy's numpy.sum

What are the differences in performance and behavior between using Python's native sum function and NumPy's numpy.sum? sum works on NumPy's arrays and numpy.sum works on Python lists and they both return the same effective result (haven't tested edge cases such as overflow) but different types.
>>> import numpy as np
>>> np_a = np.array(range(5))
>>> np_a
array([0, 1, 2, 3, 4])
>>> type(np_a)
<class 'numpy.ndarray')
>>> py_a = list(range(5))
>>> py_a
[0, 1, 2, 3, 4]
>>> type(py_a)
<class 'list'>
# The numerical answer (10) is the same for the following sums:
>>> type(np.sum(np_a))
<class 'numpy.int32'>
>>> type(sum(np_a))
<class 'numpy.int32'>
>>> type(np.sum(py_a))
<class 'numpy.int32'>
>>> type(sum(py_a))
<class 'int'>
Edit: I think my practical question here is would using numpy.sum on a list of Python integers be any faster than using Python's own sum?
Additionally, what are the implications (including performance) of using a Python integer versus a scalar numpy.int32? For example, for a += 1, is there a behavior or performance difference if the type of a is a Python integer or a numpy.int32? I am curious if it is faster to use a NumPy scalar datatype such as numpy.int32 for a value that is added or subtracted a lot in Python code.
For clarification, I am working on a bioinformatics simulation which partly consists of collapsing multidimensional numpy.ndarrays into single scalar sums which are then additionally processed. I am using Python 3.2 and NumPy 1.6.
I got curious and timed it. numpy.sum seems much faster for numpy arrays, but much slower on lists.
import numpy as np
import timeit
x = range(1000)
# or
#x = np.random.standard_normal(1000)
def pure_sum():
return sum(x)
def numpy_sum():
return np.sum(x)
n = 10000
t1 = timeit.timeit(pure_sum, number = n)
print 'Pure Python Sum:', t1
t2 = timeit.timeit(numpy_sum, number = n)
print 'Numpy Sum:', t2
Result when x = range(1000):
Pure Python Sum: 0.445913167735
Numpy Sum: 8.54926219673
Result when x = np.random.standard_normal(1000):
Pure Python Sum: 12.1442425643
Numpy Sum: 0.303303771848
I am using Python 2.7.2 and Numpy 1.6.1
[...] my [...] question here is would using numpy.sum on a list of Python integers be any faster than using Python's own sum?
The answer to this question is: No.
Pythons sum will be faster on lists, while NumPys sum will be faster on arrays. I actually did a benchmark to show the timings (Python 3.6, NumPy 1.14):
import random
import numpy as np
import matplotlib.pyplot as plt
from simple_benchmark import benchmark
%matplotlib notebook
def numpy_sum(it):
return np.sum(it)
def python_sum(it):
return sum(it)
def numpy_sum_method(arr):
return arr.sum()
b_array = benchmark(
[numpy_sum, numpy_sum_method, python_sum],
arguments={2**i: np.random.randint(0, 10, 2**i) for i in range(2, 21)},
argument_name='array size',
function_aliases={numpy_sum: 'numpy.sum(<array>)', numpy_sum_method: '<array>.sum()', python_sum: "sum(<array>)"}
)
b_list = benchmark(
[numpy_sum, python_sum],
arguments={2**i: [random.randint(0, 10) for _ in range(2**i)] for i in range(2, 21)},
argument_name='list size',
function_aliases={numpy_sum: 'numpy.sum(<list>)', python_sum: "sum(<list>)"}
)
With these results:
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
b_array.plot(ax=ax1)
b_list.plot(ax=ax2)
Left: on a NumPy array; Right: on a Python list.
Note that this is a log-log plot because the benchmark covers a very wide range of values. However for qualitative results: Lower means better.
Which shows that for lists Pythons sum is always faster while np.sum or the sum method on the array will be faster (except for very short arrays where Pythons sum is faster).
Just in case you're interested in comparing these against each other I also made a plot including all of them:
f, ax = plt.subplots(1)
b_array.plot(ax=ax)
b_list.plot(ax=ax)
ax.grid(which='both')
Interestingly the point at which numpy can compete on arrays with Python and lists is roughly at around 200 elements! Note that this number may depend on a lot of factors, such as Python/NumPy version, ... Don't take it too literally.
What hasn't been mentioned is the reason for this difference (I mean the large scale difference not the difference for short lists/arrays where the functions simply have different constant overhead). Assuming CPython a Python list is a wrapper around a C (the language C) array of pointers to Python objects (in this case Python integers). These integers can be seen as wrappers around a C integer (not actually correct because Python integers can be arbitrarily big so it cannot simply use one C integer but it's close enough).
For example a list like [1, 2, 3] would be (schematically, I left out a few details) stored like this:
A NumPy array however is a wrapper around a C array containing C values (in this case int or long depending on 32 or 64bit and depending on the operating system).
So a NumPy array like np.array([1, 2, 3]) would look like this:
The next thing to understand is how these functions work:
Pythons sum iterates over the iterable (in this case the list or array) and adds all elements.
NumPys sum method iterates over the stored C array and adds these C values and finally wraps that value in a Python type (in this case numpy.int32 (or numpy.int64) and returns it.
NumPys sum function converts the input to an array (at least if it isn't an array already) and then uses the NumPy sum method.
Clearly adding C values from a C array is much faster than adding Python objects, which is why the NumPy functions can be much faster (see the second plot above, the NumPy functions on arrays beat the Python sum by far for large arrays).
But converting a Python list to a NumPy array is relatively slow and then you still have to add the C values. Which is why for lists the Python sum will be faster.
The only remaining open question is why is Pythons sum on an array so slow (it's the slowest of all compared functions). And that actually has to do with the fact that Pythons sum simply iterates over whatever you pass in. In case of a list it gets the stored Python object but in case of a 1D NumPy array there are no stored Python objects, just C values, so Python&NumPy have to create a Python object (an numpy.int32 or numpy.int64) for each element and then these Python objects have to be added. The creating the wrapper for the C value is what makes it really slow.
Additionally, what are the implications (including performance) of using a Python integer versus a scalar numpy.int32? For example, for a += 1, is there a behavior or performance difference if the type of a is a Python integer or a numpy.int32?
I made some tests and for addition and subtractions of scalars you should definitely stick with Python integers. Even though there could be some caching going on which means that the following tests might not be totally representative:
from itertools import repeat
python_integer = 1000
numpy_integer_32 = np.int32(1000)
numpy_integer_64 = np.int64(1000)
def repeatedly_add_one(val):
for _ in repeat(None, 100000):
_ = val + 1
%timeit repeatedly_add_one(python_integer)
3.7 ms ± 71.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit repeatedly_add_one(numpy_integer_32)
14.3 ms ± 162 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit repeatedly_add_one(numpy_integer_64)
18.5 ms ± 494 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
def repeatedly_sub_one(val):
for _ in repeat(None, 100000):
_ = val - 1
%timeit repeatedly_sub_one(python_integer)
3.75 ms ± 236 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit repeatedly_sub_one(numpy_integer_32)
15.7 ms ± 437 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit repeatedly_sub_one(numpy_integer_64)
19 ms ± 834 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It's 3-6 times faster to do scalar operations with Python integers than with NumPy scalars. I haven't checked why that's the case but my guess is that NumPy scalars are rarely used and probably not optimized for performance.
The difference becomes a bit less if you actually perform arithmetic operations where both operands are numpy scalars:
def repeatedly_add_one(val):
one = type(val)(1) # create a 1 with the same type as the input
for _ in repeat(None, 100000):
_ = val + one
%timeit repeatedly_add_one(python_integer)
3.88 ms ± 273 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit repeatedly_add_one(numpy_integer_32)
6.12 ms ± 324 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit repeatedly_add_one(numpy_integer_64)
6.49 ms ± 265 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Then it's only 2 times slower.
In case you wondered why I used itertools.repeat here when I could simply have used for _ in range(...) instead. The reason is that repeat is faster and thus incurs less overhead per loop. Because I'm only interested in the addition/subtraction time it's actually preferable not to have the looping overhead messing with the timings (at least not that much).
Note that Python sum on multidimensional numpy arrays will only perform a sum along the first axis:
sum(np.array([[[2,3,4],[4,5,6]],[[7,8,9],[10,11,12]]]))
Out[47]:
array([[ 9, 11, 13],
[14, 16, 18]])
np.sum(np.array([[[2,3,4],[4,5,6]],[[7,8,9],[10,11,12]]]), axis=0)
Out[48]:
array([[ 9, 11, 13],
[14, 16, 18]])
np.sum(np.array([[[2,3,4],[4,5,6]],[[7,8,9],[10,11,12]]]))
Out[49]: 81
Numpy should be much faster, especially when your data is already a numpy array.
Numpy arrays are a thin layer over a standard C array. When numpy sum iterates over this, it isn't doing type checking and it is very fast. The speed should be comparable to doing the operation using standard C.
In comparison, using python's sum it has to first convert the numpy array to a python array, and then iterate over that array. It has to do some type checking and is generally going to be slower.
The exact amount that python sum is slower than numpy sum is not well defined as the python sum is going to be a somewhat optimized function as compared to writing your own sum function in python.
This is an extension to the the answer post above by Akavall. From that answer you can see that np.sum performs faster for np.array objects, whereas sum performs faster for list objects. To expand upon that:
On running np.sum for an np.array object Vs. sum for a list object, it seems that they perform neck to neck.
# I'm running IPython
In [1]: x = range(1000) # list object
In [2]: y = np.array(x) # np.array object
In [3]: %timeit sum(x)
100000 loops, best of 3: 14.1 µs per loop
In [4]: %timeit np.sum(y)
100000 loops, best of 3: 14.3 µs per loop
Above, sum is a tiny bit faster than np.array, although, at times I've seen np.sum timings to be 14.1 µs, too. But mostly, it's 14.3 µs.
if you use sum(), then it gives
a = np.arange(6).reshape(2, 3)
print(a)
print(sum(a))
print(sum(sum(a)))
print(np.sum(a))
>>>
[[0 1 2]
[3 4 5]]
[3 5 7]
15
15

Categories