From timing the creation of Nx4096x4096 arrays, it appears Numpy does it much faster when N = 2 or 3 than N = 1:
import numpy as np
%timeit a = np.zeros((2, 4096, 4096), dtype=np.float32, order='C')
5.24 µs ± 98.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit a = np.zeros((4096, 4096), dtype=np.float32, order='C')
23.4 ms ± 401 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The difference is shocking. Why is that so and how to make the case when N = 1 at least as fast as when N > 1? Could the "%timeit" be simply wrong for timing this?
Context: I need to create another single array of 4096 x 4096 with a different type (uint8), and I'm trying to get the fastest Pythonic (or Numpy-related) implementation. The Nx4096x4096 array wil be populated with non-zeros values from a 3-column array (read from a file) where the 1st column are 1D coordinates and 2nd and 3rd column are the intensity values for the 1st and 2nd image (hence the N=2 case). Using sparse matrix is for now not an option.
There are 130 million of such files. So the above is happening as many times.
[EDIT] This is under Python 3.6.4, numpy 1.14 under macOS Sierra. Same version under Windows do not reproduce the same behavior. The np.zeros() for the smaller array take half the time than the twice-larger array. From the comments and the mentionned duplicate question I understand this can be due to thresholds in memory allocations. This does however defeat the purpose of %timeit.
[EDIT 2] Regarding the duplicate question, the question here should be now more about how to time this function properly, without having to write extra code that will access the variable so the OS actually allocates the memory. Wouldn't that extra code bias the result of the timing? Isn't there a simple way to profile this?
Related
I have recently discovered that Numba may work much slower than pure python even in non-python mode with the parrallel=True option enabled.
Important: If you don't deal with Voronoi diagrams please continue reading, my question doesn't relate to them directly.
Currently, I am working on a problem where I have energy associated with the Voronoi diagram's edges and cells areas. The scipy Vornoi returns an array containing couples of points (vor.ridge_points) associated with each Vornoi edge. For my code, I want to have the ability to get index of the edge when providing indexes of the associated points, so I define a kind of adjacency matrix, but instead of ones and zeros, it has zeros and indexes of edges.
It turns out that pure python when performing cycles over numpy arrays turns to be 10 times faster than numba. Here is a toy example (i just randomly generated arrays, for the same number of edges and points as in my simulation).
My guess that it has something to do with memory allocation. Any take on the subject would be apprectiated (the reason why is it so much slower or a better way to get edge number from numbers of points) :)
# %%
from numba.np.ufunc import parallel
import numpy as np
from numba import njit
from numba import prange
# %% generating array that models array og ridges
points_number = 8802
ridges_number = 26379
np.random.seed(123)
ridge_points = np.random.randint(points_number, size=(ridges_number, 2))
# %% symmetric matrix containing indexes of all edges
# in space [original_point_1, original_point_2]
ridge_points = np.array(ridge_points, dtype=np.int32)
#njit(parallel=True, cache=True)
def jit_edges_matrix_op(r_p, r_n):
matrix = np.zeros((r_n, r_n), dtype=np.int32)
for i in prange(r_n):
e1 = r_p[i, 0]
e2 = r_p[i, 1]
matrix[e1, e2] = i
matrix[e2, e1] = i
return matrix
e_matrix_op = jit_edges_matrix_op(ridge_points, ridges_number)
# %% the same but not jitted
def edges_matrix_op(r_p, r_n):
matrix = np.zeros((r_n, r_n), dtype=np.int32)
for i in range(r_n):
e1 = r_p[i, 0]
e2 = r_p[i, 1]
matrix[e1, e2] = i
matrix[e2, e1] = i
return matrix
e_matrix_op = edges_matrix_op(ridge_points, ridges_number)
# %%
%%timeit
jit_edges_matrix_op(ridge_points, ridges_number)
# %%
%%timeit
edges_matrix_op(ridge_points, ridges_number)
UPDATE
Indeed parallelization is not working properly here, so I run tests with parallel=False. Here are the results
630 ms ± 20.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - parallel=True
553 ms ± 4.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - parallel=False
66.5 ms ± 3.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) - pure python
UPDATE 2
Thanks to max9111 sharing a link https://github.com/numba/numba/issues/7259
There seems to be an issue with allocating large arrays with zeros (np.zeros)
The issue has been reported a couple of weeks ago, and the link contains some workaround examples.
I tried allocating np.empty()
29.7 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)- numba parallel=True
44.7 ms ± 2.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)- numba parallel=False
60.4 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)- pure python
And as you can see parallelized numba works the best, so this task is parallizable and overhead is not that big
I think the crucial issue is related to what is the nature of the calculations you are performing in this part of the code:
for i in range(ridges_number):
e1 = r_p[i, 0]
e2 = r_p[i, 1]
matrix[e1, e2] = i
matrix[e2, e1] = i
Loop calculations are performing best if they are cache-local and trivially paralellizable (i.e. calculations are independent in each loop).
In your case, both of the conditions are violated. e1 and e2 do not take consecutive values across all of the loops. Similarly, the matrix r_p is likely preventing efficient paralellization because it needs to be accessed by all of the threads in each of the loops and probably it is locked by one while being accesed by all others).
All in all, the function you chose to speed-up may suffer the overhead of paralellization while in effect the calculations are executed sequentially. And the calculations, at least as they are at the moment, are inherently difficult to speed up by parallelization.
I have a program whose current performance bottleneck involves creating a new array from a relatively short list of relatively long, flat arrays:
num_arrays = 5
array_length = 1000
arrays = [np.random.random((array_length, )) for _ in range(num_arrays)]
new_array = np.array(arrays)
In other words, stacking n arrays of shape (s,) into new_array of shape (n, s).
I am looking for the most efficient way to compute this, since this operation is repeated millions of times.
I tested for performance of the two trivial ways to do this:
%timeit np.array(arrays)
>>> 3.6 µs ± 67.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.stack(arrays)
>>> 9.61 µs ± 133 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
I am currently using np.array(arrays), but I am wondering if there is a more efficient way to do this.
Some details which might help:
The length of the arrays is fixed throughout the runtime of the program, e.g. 1000.
The number of arrays is usually low, usually <=5. It is possible to get the upper bound for this at checkpoints throughout the run of the program (i.e. every ~1000 creation of such arrays), but not in advance.
I want to do the element-wise outer product of three (or four) large 2D arrays in python (values are float32 rounded to 2 decimals). They all have the same number of rows "n", but different number of columns "i", "j", "k".
The resulting array should be of shape (n, i*j*k). Then, I want to sum each column of the result to end up with a 1D array of shape (i*j*k).
np.shape(a) = (75466, 10)
np.shape(b) = (75466, 28)
np.shape(c) = (75466, 66)
np.shape(intermediate_result) = (75466, 18480)
np.shape(result) = (18480)
Thanks to ruankesi and divakar, I got a piece of code that works:
# Multiply first two matrices
first_multi = a[...,None] * b[:,None]
# could use np.einsum('ij,ik->ijk',a,b), which is slightly faster
ab_fills = first_multi.reshape(a.shape[0], a.shape[1]*b.shape[1])
# Multiply the result with the third matrix
second_multi = ab_fills[..., None] * c[:,None]
abc_fills = second_multi.reshape(ab_fills.shape[0], ab_fills.shape[1] * c.shape[1])
# Get the result: sum columns and get a 1D array of length 10*28*66 = 18 480
result = np.sum(abc_fills, axis = 0)
Problem 1: Performance
This takes about 3 seconds, but I have to repeat this operation many times and some of the matrices are even larger (in number of rows). It is acceptable but making it faster would be nice.
Problem 2: My matrices are sparse
Indeed, for instance, "a" contains 70% of 0s. I tried to play with scipy csc_matrix, but really could not get a working version. (to get the element-wise outer product here I go via a conversion to a 3D matrix, which are not supported in scipy sparse_matrix)
Problem 3: memory usage
If I try to also work with a 4th matrix, I run into memory issues.
I imagine that converting this code to sparse_matrix would save a lot of memory, and make the calculation faster by ignoring the numerous 0 values.
Is that true? If yes, can someone help me?
Of course, if you have any suggestion for a better implementation, I am also very interested. I don't need any of the intermediate results, just the final 1D result.
It's been weeks I'm stuck on this part of code, I am going nuts!
Thank you!
Edit after Divakar's answer
Approach #1:
Very nice one liner but surprisingly slower than the original approach (?).
On my test dataset, approach #1 takes 4.98 s ± 3.06 ms per loop (no speedup with optimize = True)
The original decomposed approach took 3.01 s ± 16.5 ms per loop
Approach #2:
Absolutely great, thank you! What an impressive speedup!
62.6 ms ± 233 µs per loop
About numexpr, I try to avoid as much as possible requirements for external modules, and I don't plan to use multicores/threads. This is an "embarrassingly" parallelizable task, with hundreds of thousands of objects to analyze, I'll just spread the list across available CPUs during production. I will give it a try for memory optimization.
As a brief try of numexpr with a restriction for 1 thread, performing 1 multiplication, I get a runtime of 40ms without numexpr, and 52 ms with numexpr.
Thanks again!!
Approach #1
We can use np.einsum to do sum-reductions in one go -
result = np.einsum('ij,ik,il->jkl',a,b,c).ravel()
Also, play around with the optimize flag in np.einsum by setting it as True to use BLAS.
Approach #2
We can use broadcasting to do the first step as also mentioned in the posted code and then leverage tensor-matrix-multiplcation with np.tensordot -
def broadcast_dot(a,b,c):
first_multi = a[...,None] * b[:,None]
return np.tensordot(first_multi,c, axes=(0,0)).ravel()
We can also use numexpr module that supports multi-core processing and also achieves better memory efficiency to get first_multi. This gives us a modified solution, like so -
import numexpr as ne
def numexpr_broadcast_dot(a,b,c):
first_multi = ne.evaluate('A*B',{'A':a[...,None],'B':b[:,None]})
return np.tensordot(first_multi,c, axes=(0,0)).ravel()
Timings on random float data with given dataset sizes -
In [36]: %timeit np.einsum('ij,ik,il->jkl',a,b,c).ravel()
4.57 s ± 75.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [3]: %timeit broadcast_dot(a,b,c)
270 ms ± 103 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: %timeit numexpr_broadcast_dot(a,b,c)
172 ms ± 63.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Just to give a sense of improvement with numexpr -
In [7]: %timeit a[...,None] * b[:,None]
80.4 ms ± 2.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [8]: %timeit ne.evaluate('A*B',{'A':a[...,None],'B':b[:,None]})
25.9 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
This should be substantial when extending this solution to higher number of inputs.
I want to create an empty Numpy array in Python, to later fill it with values. The code below generates a 1024x1024x1024 array with 2-byte integers, which means it should take at least 2GB in RAM.
>>> import numpy as np; from sys import getsizeof
>>> A = np.zeros((1024,1024,1024), dtype=np.int16)
>>> getsizeof(A)
2147483776
From getsizeof(A), we see that the array takes 2^31 + 128 bytes (presumably of header information.) However, using my task manager, I can see Python is only taking 18.7 MiB of memory.
Assuming the array is compressed, I assigned random values to each memory slot so that it could not be.
>>> for i in range(1024):
... for j in range(1024):
... for k in range(1024):
... A[i,j,k] = np.random.randint(32767, dtype = np.int16)
The loop is still running, and my RAM is slowly increasing (presumably as the arrays composing A inflate with the incompresible noise.) I'm assuming it would make my code faster to force numpy to expand this array from the beginning. Curiously, I haven't seen this documented anywhere!
So, 1. Why does numpy do this? and 2. How can I force numpy to allocate memory?
A neat answer to your first question can also be found in this StackOverflow answer.
To answer your second question, you can force the memory to be allocated as follows in a more or less efficient manner:
A = np.empty((1024,1024,1024), dtype=np.int16)
A.fill(0)
because then the memory is touched.
At my machine with my setup,
A = np.empty(0)
A.resize((1024, 1024, 1024))
also does the trick, but I cannot find this behavior documented, and this might be an implementation detail; realloc is used under the hood in numpy.
Let's look at some timings for a smaller case:
In [107]: A = np.zeros(10000,int)
In [108]: for i in range(A.shape[0]): A[i]=np.random.randint(327676)
We don't need to make A 3d to get the same effect; 1d of the same total size would be just as good.
In [109]: timeit for i in range(A.shape[0]): A[i]=np.random.randint(327676)
37 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Now compare that time to the alternative of generating the random numbers with one call:
In [110]: timeit np.random.randint(327676, size=A.shape)
185 µs ± 905 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Much much faster.
If we do the same loop, but simply assign the random number to a variable (and throw it away):
In [111]: timeit for i in range(A.shape[0]): x=np.random.randint(327676)
32.3 ms ± 171 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The times are nearly the same as the original case. Assigning the values to the zeros array is not the big time consumer.
I'm not testing a very large case as you are, and my A has already been initialized in full. So you are welcome repeat the comparisons with your size. But I think the pattern will still hold - iteration 1024x1024x1024 times (100,000 larger than my example) is the big time consumer, not the memory allocation task.
Something else you might experimenting with: just iterate on the first dimension of A, and assign randomint shaped like the other 2 dimensions. For example, expanding my A with a size 10 dimension:
In [112]: A = np.zeros((10,10000),int)
In [113]: timeit for i in range(A.shape[0]): A[i]=np.random.randint(327676,size=A.shape[1])
1.95 ms ± 31.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
A is 10x larger than in [107], but take 16x less time to fill, because it only as to iterate 10x. In numpy if you must iterate, try to do it a few times on a more complex task.
(timeit repeats the test many times (e.g. 7*10), so it isn't going to capture any initial memory allocation step, even if I use a large enough array for that to matter).
I'm trying to execute the following
from numpy import *
x = array([[3,2,3],[711,4,104],.........,[4,4,782,7845]]) # large nparray
for item in x:
set(item)
and it takes very long compared to:
x = array([[3,2,3],[711,4,104],.........,[4,4,782,7845]]) # large nparray
for item in x:
item.tolist()
Why does it take much longer to convert a NumPy array to a set than to a list?
I mean basically both have complexity O(n)?
TL;DR: The set() function creates a set using Pythons iteration protocol. But iterating (on the Python level) over NumPy arrays is so slow that using tolist() to convert the array to a Python list before doing the iteration is (much) faster.
To understand why iterating over NumPy arrays is so slow it's important to know how Python objects, Python lists, and NumPy arrays are stored in memory.
A Python object needs some bookkeeping properties (like the reference count, a link to its class, ...) and the value it represents. For example the integer ten = 10 could look like this:
The blue circle is the "name" you use in the Python interpreter for the variable ten and the lower object (instance) is what actually represents the integer (since the bookkeeping properties aren't imporant here I ignored them in the images).
A Python list is just a collection of Python objects, for example mylist = [1, 2, 3] would be saved like this:
This time the list references the Python integers 1, 2 and 3 and the name mylist just references the list instance.
But an array myarray = np.array([1, 2, 3]) doesn't store Python objects as elements:
The values 1, 2 and 3 are stored directly in the NumPy array instance.
With this information I can explain why iterating over an array is so much slower compared to an iteration over a list:
Each time you access the next element in a list the list just returns a stored object. That's very fast because the element already exists as Python object (it just needs to increment the reference count by one).
On the other hand when you want an element of an array it needs to create a new Python "box" for the value with all the bookkeeping stuff before it is returned. When you iterate over the array it needs to create one Python box for each element in your array:
Creating these boxes is slow and the main reason why iterating over NumPy arrays is much slower than iterating over Python collections (lists/tuples/sets/dictionaries) which store the values and their box:
import numpy as np
arr = np.arange(100000)
lst = list(range(100000))
def iterateover(obj):
for item in obj:
pass
%timeit iterateover(arr)
# 20.2 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit iterateover(lst)
# 3.96 ms ± 26.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The set "constructor" just does an iteration over the object.
One thing I can't answer definitely is why the tolist method is so much faster. In the end each value in the resulting Python list needs to be in a "Python box" so there's not much work that tolist could avoid. But one thing I know for sure is that list(array) is slower than array.tolist():
arr = np.arange(100000)
%timeit list(arr)
# 20 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit arr.tolist()
# 10.3 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Each of these has O(n) runtime complexity but the constant factors are very different.
In your case you did compare set() to tolist() - which isn't a particular good comparison. It would make more sense to compare set(arr) to list(arr) or set(arr.tolist()) to arr.tolist():
arr = np.random.randint(0, 1000, (10000, 3))
def tosets(arr):
for line in arr:
set(line)
def tolists(arr):
for line in arr:
list(line)
def tolists_method(arr):
for line in arr:
line.tolist()
def tosets_intermediatelist(arr):
for line in arr:
set(line.tolist())
%timeit tosets(arr)
# 72.2 ms ± 2.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit tolists(arr)
# 80.5 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit tolists_method(arr)
# 16.3 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit tosets_intermediatelist(arr)
# 38.5 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So if you want sets you are better off using set(arr.tolist()). For bigger arrays it could make sense to use np.unique but because your rows only contain 3 items that will likely be slower (for thousands of elements it could be much faster!).
In the comments you asked about numba and yes, it's true that numba could speed this up. Numba supports typed sets (only numeric types), but that doesn't mean it will be always faster.
I'm not sure how numba (re-)implements sets but because they are typed it's likely they also avoid the "Python boxes" and store the values directly inside the set:
Sets are more complicated than lists because it they involve hashes and empty slots (Python uses open-addressing for sets, so I assume numba will too).
Like the NumPy array the numba set saves the values directly. So when you convert a NumPy array to a numba set (or vise-versa) it won't need to use "Python boxes" at all, so when you create the sets in a numba nopython function it will be much faster even than the set(arr.tolist()) operation:
import numba as nb
#nb.njit
def tosets_numba(arr):
for lineno in range(arr.shape[0]):
set(arr[lineno])
tosets_numba(arr) # warmup
%timeit tosets_numba(arr)
# 6.55 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
That's roughly five times faster than the set(arr.tolist()) approach. But it's important to highlight that I did not return the sets from the function. When you return a set from a nopython numba function to Python Numba creates a python set - including "creating the boxes" for all values in the set (that's something numba is hiding).
Just FYI: The same boxing/unboxing happens if you pass lists to Numba nopython functions or return lists from these functions. So what's a O(1) operation in Python is an O(n) operation with Numba! That's why it's generally better to pass NumPy arrays to numba nopython function (which is O(1)).
I assume that if you return these sets from the function (not really possible right now because numba doesn't support lists of sets currently) it would be slower (because it creates a numba set and later converts it to a python set) or only marginally faster (if the conversion numbaset -> pythonset is really, really fast).
Personally I would use numba for sets only if I don't need to return them from the function and do all operations on the set inside the function and only if all the operations on the set are supported in nopython mode. In any other case I wouldn't use numba here.
Just a note: from numpy import * should be avoided, you hide several python built-in functions when you do that (sum, min, max, ...) and it puts a lot of stuff into your globals. Better to use import numpy as np. The np. in front of function calls makes the code clearer and isn't much to type.
Here is a way to speed things up: avoid the loop and use a multiprocessing pool.map trick
from multiprocessing.dummy import Pool as ThreadPool
import multiprocessing
pool = ThreadPool(multiprocessing.cpu_count()) # get the number of CPU
y = pool.map(set,x) # apply the function to your iterable
pool.close()
pool.join()