Converting NumPy array to a set takes too long - python

I'm trying to execute the following
from numpy import *
x = array([[3,2,3],[711,4,104],.........,[4,4,782,7845]]) # large nparray
for item in x:
set(item)
and it takes very long compared to:
x = array([[3,2,3],[711,4,104],.........,[4,4,782,7845]]) # large nparray
for item in x:
item.tolist()
Why does it take much longer to convert a NumPy array to a set than to a list?
I mean basically both have complexity O(n)?

TL;DR: The set() function creates a set using Pythons iteration protocol. But iterating (on the Python level) over NumPy arrays is so slow that using tolist() to convert the array to a Python list before doing the iteration is (much) faster.
To understand why iterating over NumPy arrays is so slow it's important to know how Python objects, Python lists, and NumPy arrays are stored in memory.
A Python object needs some bookkeeping properties (like the reference count, a link to its class, ...) and the value it represents. For example the integer ten = 10 could look like this:
The blue circle is the "name" you use in the Python interpreter for the variable ten and the lower object (instance) is what actually represents the integer (since the bookkeeping properties aren't imporant here I ignored them in the images).
A Python list is just a collection of Python objects, for example mylist = [1, 2, 3] would be saved like this:
This time the list references the Python integers 1, 2 and 3 and the name mylist just references the list instance.
But an array myarray = np.array([1, 2, 3]) doesn't store Python objects as elements:
The values 1, 2 and 3 are stored directly in the NumPy array instance.
With this information I can explain why iterating over an array is so much slower compared to an iteration over a list:
Each time you access the next element in a list the list just returns a stored object. That's very fast because the element already exists as Python object (it just needs to increment the reference count by one).
On the other hand when you want an element of an array it needs to create a new Python "box" for the value with all the bookkeeping stuff before it is returned. When you iterate over the array it needs to create one Python box for each element in your array:
Creating these boxes is slow and the main reason why iterating over NumPy arrays is much slower than iterating over Python collections (lists/tuples/sets/dictionaries) which store the values and their box:
import numpy as np
arr = np.arange(100000)
lst = list(range(100000))
def iterateover(obj):
for item in obj:
pass
%timeit iterateover(arr)
# 20.2 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit iterateover(lst)
# 3.96 ms ± 26.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The set "constructor" just does an iteration over the object.
One thing I can't answer definitely is why the tolist method is so much faster. In the end each value in the resulting Python list needs to be in a "Python box" so there's not much work that tolist could avoid. But one thing I know for sure is that list(array) is slower than array.tolist():
arr = np.arange(100000)
%timeit list(arr)
# 20 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit arr.tolist()
# 10.3 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Each of these has O(n) runtime complexity but the constant factors are very different.
In your case you did compare set() to tolist() - which isn't a particular good comparison. It would make more sense to compare set(arr) to list(arr) or set(arr.tolist()) to arr.tolist():
arr = np.random.randint(0, 1000, (10000, 3))
def tosets(arr):
for line in arr:
set(line)
def tolists(arr):
for line in arr:
list(line)
def tolists_method(arr):
for line in arr:
line.tolist()
def tosets_intermediatelist(arr):
for line in arr:
set(line.tolist())
%timeit tosets(arr)
# 72.2 ms ± 2.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit tolists(arr)
# 80.5 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit tolists_method(arr)
# 16.3 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit tosets_intermediatelist(arr)
# 38.5 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So if you want sets you are better off using set(arr.tolist()). For bigger arrays it could make sense to use np.unique but because your rows only contain 3 items that will likely be slower (for thousands of elements it could be much faster!).
In the comments you asked about numba and yes, it's true that numba could speed this up. Numba supports typed sets (only numeric types), but that doesn't mean it will be always faster.
I'm not sure how numba (re-)implements sets but because they are typed it's likely they also avoid the "Python boxes" and store the values directly inside the set:
Sets are more complicated than lists because it they involve hashes and empty slots (Python uses open-addressing for sets, so I assume numba will too).
Like the NumPy array the numba set saves the values directly. So when you convert a NumPy array to a numba set (or vise-versa) it won't need to use "Python boxes" at all, so when you create the sets in a numba nopython function it will be much faster even than the set(arr.tolist()) operation:
import numba as nb
#nb.njit
def tosets_numba(arr):
for lineno in range(arr.shape[0]):
set(arr[lineno])
tosets_numba(arr) # warmup
%timeit tosets_numba(arr)
# 6.55 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
That's roughly five times faster than the set(arr.tolist()) approach. But it's important to highlight that I did not return the sets from the function. When you return a set from a nopython numba function to Python Numba creates a python set - including "creating the boxes" for all values in the set (that's something numba is hiding).
Just FYI: The same boxing/unboxing happens if you pass lists to Numba nopython functions or return lists from these functions. So what's a O(1) operation in Python is an O(n) operation with Numba! That's why it's generally better to pass NumPy arrays to numba nopython function (which is O(1)).
I assume that if you return these sets from the function (not really possible right now because numba doesn't support lists of sets currently) it would be slower (because it creates a numba set and later converts it to a python set) or only marginally faster (if the conversion numbaset -> pythonset is really, really fast).
Personally I would use numba for sets only if I don't need to return them from the function and do all operations on the set inside the function and only if all the operations on the set are supported in nopython mode. In any other case I wouldn't use numba here.
Just a note: from numpy import * should be avoided, you hide several python built-in functions when you do that (sum, min, max, ...) and it puts a lot of stuff into your globals. Better to use import numpy as np. The np. in front of function calls makes the code clearer and isn't much to type.

Here is a way to speed things up: avoid the loop and use a multiprocessing pool.map trick
from multiprocessing.dummy import Pool as ThreadPool
import multiprocessing
pool = ThreadPool(multiprocessing.cpu_count()) # get the number of CPU
y = pool.map(set,x) # apply the function to your iterable
pool.close()
pool.join()

Related

Numba in nonpython mode is much slower than pure python (no print statements or specified numpy functions)

I have recently discovered that Numba may work much slower than pure python even in non-python mode with the parrallel=True option enabled.
Important: If you don't deal with Voronoi diagrams please continue reading, my question doesn't relate to them directly.
Currently, I am working on a problem where I have energy associated with the Voronoi diagram's edges and cells areas. The scipy Vornoi returns an array containing couples of points (vor.ridge_points) associated with each Vornoi edge. For my code, I want to have the ability to get index of the edge when providing indexes of the associated points, so I define a kind of adjacency matrix, but instead of ones and zeros, it has zeros and indexes of edges.
It turns out that pure python when performing cycles over numpy arrays turns to be 10 times faster than numba. Here is a toy example (i just randomly generated arrays, for the same number of edges and points as in my simulation).
My guess that it has something to do with memory allocation. Any take on the subject would be apprectiated (the reason why is it so much slower or a better way to get edge number from numbers of points) :)
# %%
from numba.np.ufunc import parallel
import numpy as np
from numba import njit
from numba import prange
# %% generating array that models array og ridges
points_number = 8802
ridges_number = 26379
np.random.seed(123)
ridge_points = np.random.randint(points_number, size=(ridges_number, 2))
# %% symmetric matrix containing indexes of all edges
# in space [original_point_1, original_point_2]
ridge_points = np.array(ridge_points, dtype=np.int32)
#njit(parallel=True, cache=True)
def jit_edges_matrix_op(r_p, r_n):
matrix = np.zeros((r_n, r_n), dtype=np.int32)
for i in prange(r_n):
e1 = r_p[i, 0]
e2 = r_p[i, 1]
matrix[e1, e2] = i
matrix[e2, e1] = i
return matrix
e_matrix_op = jit_edges_matrix_op(ridge_points, ridges_number)
# %% the same but not jitted
def edges_matrix_op(r_p, r_n):
matrix = np.zeros((r_n, r_n), dtype=np.int32)
for i in range(r_n):
e1 = r_p[i, 0]
e2 = r_p[i, 1]
matrix[e1, e2] = i
matrix[e2, e1] = i
return matrix
e_matrix_op = edges_matrix_op(ridge_points, ridges_number)
# %%
%%timeit
jit_edges_matrix_op(ridge_points, ridges_number)
# %%
%%timeit
edges_matrix_op(ridge_points, ridges_number)
UPDATE
Indeed parallelization is not working properly here, so I run tests with parallel=False. Here are the results
630 ms ± 20.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - parallel=True
553 ms ± 4.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - parallel=False
66.5 ms ± 3.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) - pure python
UPDATE 2
Thanks to max9111 sharing a link https://github.com/numba/numba/issues/7259
There seems to be an issue with allocating large arrays with zeros (np.zeros)
The issue has been reported a couple of weeks ago, and the link contains some workaround examples.
I tried allocating np.empty()
29.7 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)- numba parallel=True
44.7 ms ± 2.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)- numba parallel=False
60.4 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)- pure python
And as you can see parallelized numba works the best, so this task is parallizable and overhead is not that big
I think the crucial issue is related to what is the nature of the calculations you are performing in this part of the code:
for i in range(ridges_number):
e1 = r_p[i, 0]
e2 = r_p[i, 1]
matrix[e1, e2] = i
matrix[e2, e1] = i
Loop calculations are performing best if they are cache-local and trivially paralellizable (i.e. calculations are independent in each loop).
In your case, both of the conditions are violated. e1 and e2 do not take consecutive values across all of the loops. Similarly, the matrix r_p is likely preventing efficient paralellization because it needs to be accessed by all of the threads in each of the loops and probably it is locked by one while being accesed by all others).
All in all, the function you chose to speed-up may suffer the overhead of paralellization while in effect the calculations are executed sequentially. And the calculations, at least as they are at the moment, are inherently difficult to speed up by parallelization.

Numpy - most efficient way to create an array from a list of arrays

I have a program whose current performance bottleneck involves creating a new array from a relatively short list of relatively long, flat arrays:
num_arrays = 5
array_length = 1000
arrays = [np.random.random((array_length, )) for _ in range(num_arrays)]
new_array = np.array(arrays)
In other words, stacking n arrays of shape (s,) into new_array of shape (n, s).
I am looking for the most efficient way to compute this, since this operation is repeated millions of times.
I tested for performance of the two trivial ways to do this:
%timeit np.array(arrays)
>>> 3.6 µs ± 67.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.stack(arrays)
>>> 9.61 µs ± 133 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
I am currently using np.array(arrays), but I am wondering if there is a more efficient way to do this.
Some details which might help:
The length of the arrays is fixed throughout the runtime of the program, e.g. 1000.
The number of arrays is usually low, usually <=5. It is possible to get the upper bound for this at checkpoints throughout the run of the program (i.e. every ~1000 creation of such arrays), but not in advance.

Is there a list of big O complexities for the numpy library?

I'm doing a time complexity analysis of an algorithm and need to know what kind of complexities certain numpy operations have.
For some, I assume they match the underlying mathematical operation. Like np.dot(array1, array2) would be O(n). For others, I am not as sure. For example, is np.array(my_array) O(1)? or is it O(n)? Does it simply reassign a pointer or is it iterating over the list and copying out each value?
I want to be sure of each operation's complexity. Is there somewhere I can find this information? Or should I just assume they match the mathematical operation?
BigO complexity is not often used with Python and numpy. It's a measure of how the code scales with problem size. That's useful in a compiled language like C. But here the code is a mix of interpreted Python and compiled code. Both can have the same bigO, but the interpreted version will be orders of magnitude slower. That's why most of the SO questions about improving numpy speed, talk about 'removing loops' and 'vectorizing'.
Also few operations are pure O(n); most are a mix. There's a setup cost, plus a per element cost. If the per element cost is small, the setup cost dominates.
If starting with lists, it's often faster to iterate on the list, because converting a list to an array has a substantial overhead (O(n)).
If you already have arrays, then avoid (python level) iteration where possible. Iteration is part of most calculations, but numpy lets you do a lot of that in faster compiled code (faster O(n)).
At some point you have to understand how numpy stores its arrays. The distinction between view and copy is important. A view is in effect O(1), a copy O(n).
Often you'll see SO answers do timeit speed comparisons. I often add the caution that results might vary with problem size. The better answers will time various size problems, and show the results on a nice plot. The results are often a mix of straight lines (O(n)), and curves (varying blends of O(1) and O(n) components).
You asked specifically about np.array. Here are some sample timings:
In [134]: %%timeit alist = list(range(1000))
...: np.array(alist)
67.9 µs ± 839 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [135]: %%timeit alist = list(range(10))
...: np.array(alist)
2.19 µs ± 9.88 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [136]: %%timeit alist = list(range(2000))
...: np.array(alist)
134 µs ± 1.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
copy an array:
In [137]: %%timeit alist = list(range(2000)); arr=np.array(alist)
...: np.array(arr)
1.77 µs ± 24.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
no copy:
In [138]: %%timeit alist = list(range(2000)); arr=np.array(alist)
...: np.array(arr, copy=False)
237 ns ± 1.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
from a list of strings:
In [139]: %%timeit alist = [str(i) for i in range(2000)]
...: np.array(alist, dtype=int)
286 µs ± 4.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Almost all calculations in numpy are O(n). If it involves each element of an array it, speed will depend on the size of the array. Some array manipulations are O(1), such as reshaping, because they don't actually do anything with the data; they change properties like shape and strides.
Search problems often grow faster than O(n); usually numpy is not the best choice for that kind of problem. Smart of use Python lists and dictionaries can be faster.
For the specific example np.array(my_array) as it needs to run through all the elements of my_array, allocate memory and initialize the values, it takes place in linear time.
There is a python module big_O that can be used to analyze the complexity of a function from its execution time.
Refer to this link for more information

Why doesn't numpy.zeros allocate all of its memory on creation? And how can I force it to?

I want to create an empty Numpy array in Python, to later fill it with values. The code below generates a 1024x1024x1024 array with 2-byte integers, which means it should take at least 2GB in RAM.
>>> import numpy as np; from sys import getsizeof
>>> A = np.zeros((1024,1024,1024), dtype=np.int16)
>>> getsizeof(A)
2147483776
From getsizeof(A), we see that the array takes 2^31 + 128 bytes (presumably of header information.) However, using my task manager, I can see Python is only taking 18.7 MiB of memory.
Assuming the array is compressed, I assigned random values to each memory slot so that it could not be.
>>> for i in range(1024):
... for j in range(1024):
... for k in range(1024):
... A[i,j,k] = np.random.randint(32767, dtype = np.int16)
The loop is still running, and my RAM is slowly increasing (presumably as the arrays composing A inflate with the incompresible noise.) I'm assuming it would make my code faster to force numpy to expand this array from the beginning. Curiously, I haven't seen this documented anywhere!
So, 1. Why does numpy do this? and 2. How can I force numpy to allocate memory?
A neat answer to your first question can also be found in this StackOverflow answer.
To answer your second question, you can force the memory to be allocated as follows in a more or less efficient manner:
A = np.empty((1024,1024,1024), dtype=np.int16)
A.fill(0)
because then the memory is touched.
At my machine with my setup,
A = np.empty(0)
A.resize((1024, 1024, 1024))
also does the trick, but I cannot find this behavior documented, and this might be an implementation detail; realloc is used under the hood in numpy.
Let's look at some timings for a smaller case:
In [107]: A = np.zeros(10000,int)
In [108]: for i in range(A.shape[0]): A[i]=np.random.randint(327676)
We don't need to make A 3d to get the same effect; 1d of the same total size would be just as good.
In [109]: timeit for i in range(A.shape[0]): A[i]=np.random.randint(327676)
37 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Now compare that time to the alternative of generating the random numbers with one call:
In [110]: timeit np.random.randint(327676, size=A.shape)
185 µs ± 905 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Much much faster.
If we do the same loop, but simply assign the random number to a variable (and throw it away):
In [111]: timeit for i in range(A.shape[0]): x=np.random.randint(327676)
32.3 ms ± 171 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The times are nearly the same as the original case. Assigning the values to the zeros array is not the big time consumer.
I'm not testing a very large case as you are, and my A has already been initialized in full. So you are welcome repeat the comparisons with your size. But I think the pattern will still hold - iteration 1024x1024x1024 times (100,000 larger than my example) is the big time consumer, not the memory allocation task.
Something else you might experimenting with: just iterate on the first dimension of A, and assign randomint shaped like the other 2 dimensions. For example, expanding my A with a size 10 dimension:
In [112]: A = np.zeros((10,10000),int)
In [113]: timeit for i in range(A.shape[0]): A[i]=np.random.randint(327676,size=A.shape[1])
1.95 ms ± 31.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
A is 10x larger than in [107], but take 16x less time to fill, because it only as to iterate 10x. In numpy if you must iterate, try to do it a few times on a more complex task.
(timeit repeats the test many times (e.g. 7*10), so it isn't going to capture any initial memory allocation step, even if I use a large enough array for that to matter).

Python's sum vs. NumPy's numpy.sum

What are the differences in performance and behavior between using Python's native sum function and NumPy's numpy.sum? sum works on NumPy's arrays and numpy.sum works on Python lists and they both return the same effective result (haven't tested edge cases such as overflow) but different types.
>>> import numpy as np
>>> np_a = np.array(range(5))
>>> np_a
array([0, 1, 2, 3, 4])
>>> type(np_a)
<class 'numpy.ndarray')
>>> py_a = list(range(5))
>>> py_a
[0, 1, 2, 3, 4]
>>> type(py_a)
<class 'list'>
# The numerical answer (10) is the same for the following sums:
>>> type(np.sum(np_a))
<class 'numpy.int32'>
>>> type(sum(np_a))
<class 'numpy.int32'>
>>> type(np.sum(py_a))
<class 'numpy.int32'>
>>> type(sum(py_a))
<class 'int'>
Edit: I think my practical question here is would using numpy.sum on a list of Python integers be any faster than using Python's own sum?
Additionally, what are the implications (including performance) of using a Python integer versus a scalar numpy.int32? For example, for a += 1, is there a behavior or performance difference if the type of a is a Python integer or a numpy.int32? I am curious if it is faster to use a NumPy scalar datatype such as numpy.int32 for a value that is added or subtracted a lot in Python code.
For clarification, I am working on a bioinformatics simulation which partly consists of collapsing multidimensional numpy.ndarrays into single scalar sums which are then additionally processed. I am using Python 3.2 and NumPy 1.6.
I got curious and timed it. numpy.sum seems much faster for numpy arrays, but much slower on lists.
import numpy as np
import timeit
x = range(1000)
# or
#x = np.random.standard_normal(1000)
def pure_sum():
return sum(x)
def numpy_sum():
return np.sum(x)
n = 10000
t1 = timeit.timeit(pure_sum, number = n)
print 'Pure Python Sum:', t1
t2 = timeit.timeit(numpy_sum, number = n)
print 'Numpy Sum:', t2
Result when x = range(1000):
Pure Python Sum: 0.445913167735
Numpy Sum: 8.54926219673
Result when x = np.random.standard_normal(1000):
Pure Python Sum: 12.1442425643
Numpy Sum: 0.303303771848
I am using Python 2.7.2 and Numpy 1.6.1
[...] my [...] question here is would using numpy.sum on a list of Python integers be any faster than using Python's own sum?
The answer to this question is: No.
Pythons sum will be faster on lists, while NumPys sum will be faster on arrays. I actually did a benchmark to show the timings (Python 3.6, NumPy 1.14):
import random
import numpy as np
import matplotlib.pyplot as plt
from simple_benchmark import benchmark
%matplotlib notebook
def numpy_sum(it):
return np.sum(it)
def python_sum(it):
return sum(it)
def numpy_sum_method(arr):
return arr.sum()
b_array = benchmark(
[numpy_sum, numpy_sum_method, python_sum],
arguments={2**i: np.random.randint(0, 10, 2**i) for i in range(2, 21)},
argument_name='array size',
function_aliases={numpy_sum: 'numpy.sum(<array>)', numpy_sum_method: '<array>.sum()', python_sum: "sum(<array>)"}
)
b_list = benchmark(
[numpy_sum, python_sum],
arguments={2**i: [random.randint(0, 10) for _ in range(2**i)] for i in range(2, 21)},
argument_name='list size',
function_aliases={numpy_sum: 'numpy.sum(<list>)', python_sum: "sum(<list>)"}
)
With these results:
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
b_array.plot(ax=ax1)
b_list.plot(ax=ax2)
Left: on a NumPy array; Right: on a Python list.
Note that this is a log-log plot because the benchmark covers a very wide range of values. However for qualitative results: Lower means better.
Which shows that for lists Pythons sum is always faster while np.sum or the sum method on the array will be faster (except for very short arrays where Pythons sum is faster).
Just in case you're interested in comparing these against each other I also made a plot including all of them:
f, ax = plt.subplots(1)
b_array.plot(ax=ax)
b_list.plot(ax=ax)
ax.grid(which='both')
Interestingly the point at which numpy can compete on arrays with Python and lists is roughly at around 200 elements! Note that this number may depend on a lot of factors, such as Python/NumPy version, ... Don't take it too literally.
What hasn't been mentioned is the reason for this difference (I mean the large scale difference not the difference for short lists/arrays where the functions simply have different constant overhead). Assuming CPython a Python list is a wrapper around a C (the language C) array of pointers to Python objects (in this case Python integers). These integers can be seen as wrappers around a C integer (not actually correct because Python integers can be arbitrarily big so it cannot simply use one C integer but it's close enough).
For example a list like [1, 2, 3] would be (schematically, I left out a few details) stored like this:
A NumPy array however is a wrapper around a C array containing C values (in this case int or long depending on 32 or 64bit and depending on the operating system).
So a NumPy array like np.array([1, 2, 3]) would look like this:
The next thing to understand is how these functions work:
Pythons sum iterates over the iterable (in this case the list or array) and adds all elements.
NumPys sum method iterates over the stored C array and adds these C values and finally wraps that value in a Python type (in this case numpy.int32 (or numpy.int64) and returns it.
NumPys sum function converts the input to an array (at least if it isn't an array already) and then uses the NumPy sum method.
Clearly adding C values from a C array is much faster than adding Python objects, which is why the NumPy functions can be much faster (see the second plot above, the NumPy functions on arrays beat the Python sum by far for large arrays).
But converting a Python list to a NumPy array is relatively slow and then you still have to add the C values. Which is why for lists the Python sum will be faster.
The only remaining open question is why is Pythons sum on an array so slow (it's the slowest of all compared functions). And that actually has to do with the fact that Pythons sum simply iterates over whatever you pass in. In case of a list it gets the stored Python object but in case of a 1D NumPy array there are no stored Python objects, just C values, so Python&NumPy have to create a Python object (an numpy.int32 or numpy.int64) for each element and then these Python objects have to be added. The creating the wrapper for the C value is what makes it really slow.
Additionally, what are the implications (including performance) of using a Python integer versus a scalar numpy.int32? For example, for a += 1, is there a behavior or performance difference if the type of a is a Python integer or a numpy.int32?
I made some tests and for addition and subtractions of scalars you should definitely stick with Python integers. Even though there could be some caching going on which means that the following tests might not be totally representative:
from itertools import repeat
python_integer = 1000
numpy_integer_32 = np.int32(1000)
numpy_integer_64 = np.int64(1000)
def repeatedly_add_one(val):
for _ in repeat(None, 100000):
_ = val + 1
%timeit repeatedly_add_one(python_integer)
3.7 ms ± 71.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit repeatedly_add_one(numpy_integer_32)
14.3 ms ± 162 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit repeatedly_add_one(numpy_integer_64)
18.5 ms ± 494 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
def repeatedly_sub_one(val):
for _ in repeat(None, 100000):
_ = val - 1
%timeit repeatedly_sub_one(python_integer)
3.75 ms ± 236 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit repeatedly_sub_one(numpy_integer_32)
15.7 ms ± 437 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit repeatedly_sub_one(numpy_integer_64)
19 ms ± 834 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It's 3-6 times faster to do scalar operations with Python integers than with NumPy scalars. I haven't checked why that's the case but my guess is that NumPy scalars are rarely used and probably not optimized for performance.
The difference becomes a bit less if you actually perform arithmetic operations where both operands are numpy scalars:
def repeatedly_add_one(val):
one = type(val)(1) # create a 1 with the same type as the input
for _ in repeat(None, 100000):
_ = val + one
%timeit repeatedly_add_one(python_integer)
3.88 ms ± 273 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit repeatedly_add_one(numpy_integer_32)
6.12 ms ± 324 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit repeatedly_add_one(numpy_integer_64)
6.49 ms ± 265 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Then it's only 2 times slower.
In case you wondered why I used itertools.repeat here when I could simply have used for _ in range(...) instead. The reason is that repeat is faster and thus incurs less overhead per loop. Because I'm only interested in the addition/subtraction time it's actually preferable not to have the looping overhead messing with the timings (at least not that much).
Note that Python sum on multidimensional numpy arrays will only perform a sum along the first axis:
sum(np.array([[[2,3,4],[4,5,6]],[[7,8,9],[10,11,12]]]))
Out[47]:
array([[ 9, 11, 13],
[14, 16, 18]])
np.sum(np.array([[[2,3,4],[4,5,6]],[[7,8,9],[10,11,12]]]), axis=0)
Out[48]:
array([[ 9, 11, 13],
[14, 16, 18]])
np.sum(np.array([[[2,3,4],[4,5,6]],[[7,8,9],[10,11,12]]]))
Out[49]: 81
Numpy should be much faster, especially when your data is already a numpy array.
Numpy arrays are a thin layer over a standard C array. When numpy sum iterates over this, it isn't doing type checking and it is very fast. The speed should be comparable to doing the operation using standard C.
In comparison, using python's sum it has to first convert the numpy array to a python array, and then iterate over that array. It has to do some type checking and is generally going to be slower.
The exact amount that python sum is slower than numpy sum is not well defined as the python sum is going to be a somewhat optimized function as compared to writing your own sum function in python.
This is an extension to the the answer post above by Akavall. From that answer you can see that np.sum performs faster for np.array objects, whereas sum performs faster for list objects. To expand upon that:
On running np.sum for an np.array object Vs. sum for a list object, it seems that they perform neck to neck.
# I'm running IPython
In [1]: x = range(1000) # list object
In [2]: y = np.array(x) # np.array object
In [3]: %timeit sum(x)
100000 loops, best of 3: 14.1 µs per loop
In [4]: %timeit np.sum(y)
100000 loops, best of 3: 14.3 µs per loop
Above, sum is a tiny bit faster than np.array, although, at times I've seen np.sum timings to be 14.1 µs, too. But mostly, it's 14.3 µs.
if you use sum(), then it gives
a = np.arange(6).reshape(2, 3)
print(a)
print(sum(a))
print(sum(sum(a)))
print(np.sum(a))
>>>
[[0 1 2]
[3 4 5]]
[3 5 7]
15
15

Categories