Related
I want to modify block elements of 3d array without for loop. Without loop because it is the bottleneck of my code.
To illustrate what I want, I draw a figure:
The code with for loop:
import numpy as np
# Create 3d array with 2x4x4 elements
a = np.arange(2*4*4).reshape(2,4,4)
b = np.zeros(np.shape(a))
# Change Block Elements
for it1 in range(2):
b[it1]= np.block([[a[it1,0:2,0:2], a[it1,2:4,0:2]],[a[it1,0:2,2:4], a[it1,2:4,2:4]]] )
First let's see if there's a way to do what you want for a 2D array using only indexing, reshape, and transpose operations. If there is, then there's a good chance that you can extend it to a larger number of dimensions.
x = np.arange(2 * 3 * 2 * 5).reshape(2 * 3, 2 * 5)
Clearly you can reshape this into an array that has the blocks along a separate dimension:
x.reshape(2, 3, 2, 5)
Then you can transpose the resulting blocks:
x.reshape(2, 3, 2, 5).transpose(2, 1, 0, 3)
So far, none of the data has been copied. To make the copy happen, reshape back into the original shape:
x.reshape(2, 3, 2, 5).transpose(2, 1, 0, 3).reshape(2 * 3, 2 * 5)
Adding additional leading dimensions is as simple as increasing the number of the dimensions you want to swap:
b = a.reshape(a.shape[0], 2, a.shape[1] // 2, 2, a.shape[2] // 2).transpose(0, 3, 2, 1, 4).reshape(a.shape)
Here is a quick benchmark of the other implementations with your original array:
a = np.arange(2*4*4).reshape(2,4,4)
%%timeit
b = np.zeros(np.shape(a))
for it1 in range(2):
b[it1] = np.block([[a[it1, 0:2, 0:2], a[it1, 2:4, 0:2]], [a[it1, 0:2, 2:4], a[it1, 2:4, 2:4]]])
27.7 µs ± 107 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
b = a.copy()
b[:,0:2,2:4], b[:,2:4,0:2] = b[:,2:4,0:2].copy(), b[:,0:2,2:4].copy()
2.22 µs ± 3.89 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit b = np.block([[a[:,0:2,0:2], a[:,2:4,0:2]],[a[:,0:2,2:4], a[:,2:4,2:4]]])
13.6 µs ± 217 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit b = a.reshape(a.shape[0], 2, a.shape[1] // 2, 2, a.shape[2] // 2).transpose(0, 3, 2, 1, 4).reshape(a.shape)
1.27 µs ± 14.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
For small arrays, the differences can sometimes be attributed to overhead. Here is a more meaningful comparison with arrays of size 10x1000x1000, split into 10 500x500 blocks:
a = np.arange(10*1000*1000).reshape(10, 1000, 1000)
%%timeit
b = np.zeros(np.shape(a))
for it1 in range(10):
b[it1]= np.block([[a[it1,0:500,0:500], a[it1,500:1000,0:500]],[a[it1,0:500,500:1000], a[it1,500:1000,500:1000]]])
58 ms ± 904 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
b = a.copy()
b[:,0:500,500:1000], b[:,500:1000,0:500] = b[:,500:1000,0:500].copy(), b[:,0:500,500:1000].copy()
41.2 ms ± 688 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit b = np.block([[a[:,0:500,0:500], a[:,500:1000,0:500]],[a[:,0:500,500:1000], a[:,500:1000,500:1000]]])
27.5 ms ± 569 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit b = a.reshape(a.shape[0], 2, a.shape[1] // 2, 2, a.shape[2] // 2).transpose(0, 3, 2, 1, 4).reshape(a.shape)
20 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So it seems that using numpy's own reshaping and transposition mechanism is fastest on my computer. Also, notice that the overhead of np.block becomes less important than copying the temporary arrays as size gets bigger, so the other two implementations change places.
You can directly replace the it1 by a slice of the whole dimension:
b = np.block([[a[:,0:2,0:2], a[:,2:4,0:2]],[a[:,0:2,2:4], a[:,2:4,2:4]]])
Will it make it faster?
import numpy as np
a = np.arange(2*4*4).reshape(2,4,4)
b = a.copy()
b[:,0:2,2:4], b[:,2:4,0:2] = b[:,2:4,0:2].copy(), b[:,0:2,2:4].copy()
Comparison with np.block() alternative from another answer.
Option 1:
%timeit b = a.copy(); b[:,0:2,2:4], b[:,2:4,0:2] = b[:,2:4,0:2].copy(), b[:,0:2,2:4].copy()
Output:
5.44 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
Option 2
%timeit b = np.block([[a[:,0:2,0:2], a[:,2:4,0:2]],[a[:,0:2,2:4], a[:,2:4,2:4]]])
Output:
30.6 µs ± 1.75 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
I have a 1D array of integers with D elements (i.e. idx = np.array([i0, i1, ...]), s.t. idx.size = D), where each element corresponds to the index along that dimension of an ND array with D dimensions (i.e. data s.t. data.ndim = D). How can I index the data array using the index array idx?
In python I would do data[tuple(idx)], but tuple aren't supported in numba nopython mode.
My current workaround is to use data.ravel() and convert from ND indices to 1D indices of the flattened array, but it seems like there must be an easier (and computationally faster) solution. Is there a take_along_each_axis(data, idx) method somewhere?
Lets do a bit of time testing:
In [135]: data = np.ones((100,100,100,100)); idx = (50,50,50,50)
That's nearly a Gb of memory - not huge enough to create a memory error, but still should be a reasonable test. Actually, I get the same time for basic indexing for much smaller arrays. And for other idx values
In [136]: timeit data[idx]
212 ns ± 9.25 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
the interpreter translates that into a method call:
In [137]: timeit data.__getitem__(idx)
283 ns ± 4.37 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
indexing the 'flat' array, can be done with:
In [138]: timeit data.flat[np.ravel_multi_index(idx,data.shape)]
6.65 µs ± 75.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
or taking the conversion out of the loop:
In [139]: %%timeit x=np.ravel_multi_index(idx,data.shape)
...: data.flat[x]
574 ns ± 23.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [142]: %%timeit x=np.ravel_multi_index(idx,data.shape);df=data.flat
...: df[x]
345 ns ± 6.39 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
I think there are cases where flat indexing is faster, but this isn't one.
So a stand alone operation I don't see the point to writing a njit version. I suppose if it's part of some larger operation it could be worth it.
I have a function which processes an input array of dimension (h,w,200) (the number 200 can vary) and returns an array of dimension (h,w,50,3). The function takes ~0.8 seconds for an input array of size 512,512,200.
def myfunc(arr, n = 50):
#shape of arr is (h,w,200)
#output shape is (h,w,50,3)
#a1 is an array of length 50, I get them from a different
#function, which doesn't take much time. For simplicity, I fix it
#as np.arange(0,50)
a1 = np.arange(0,50)
output = np.stack((arr[:,:,a1],)*3, axis = -1)
return output
This preprocessing step is done to ~8 arrays in a single batch, due to which loading a batch of data takes 8*0.8 = 6.4 seconds. Is there a way to speed up the computation of myfunc? Can I use libraries like numba for this?
I get about the same time:
In [14]: arr = np.ones((512,512,200))
In [15]: timeit output = np.stack((arr[:,:,np.arange(50)],)*3, axis=-1)
681 ms ± 5.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [16]: np.stack((arr[:,:,np.arange(50)],)*3, axis=-1).shape
Out[16]: (512, 512, 50, 3)
Looking at the timings in more detail.
First the index/copy step, takes about 1/3 of the time:
In [17]: timeit arr[:,:,np.arange(50)]
249 ms ± 306 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
And the stack:
In [18]: %%timeit temp = arr[:,:,np.arange(50)]
...: output = np.stack([temp,temp,temp], axis=-1)
...:
...:
426 ms ± 367 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
stack expands dimensions and then concatenates; so lets call concatenate directly:
In [19]: %%timeit temp = arr[:,:,np.arange(50),None]
...: output = np.concatenate([temp,temp,temp], axis=-1)
...:
...:
430 ms ± 8.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
another approach is to use repeat:
In [20]: %%timeit temp = arr[:,:,np.arange(50),None]
...: output = np.repeat(temp, 3, axis=-1)
...:
...:
531 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
So looks like your code's as good as it gets.
Indexing and concatenate already use compiled code, so I don't expect numba to help much (not that I have much experience with it).
Stacking on a new front axis is faster (making (3, 512, 512, 50))
In [21]: %%timeit temp = arr[:,:,np.arange(50)]
...: output = np.stack([temp,temp,temp])
...:
...:
254 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
That could then be transposed (cheaply), though subsequent operations might be slower (if they require a copy and/or reordering). A plain copy of the full output array times at around 350 ms.
Inspired by comments, I tried broadcasted assignment:
In [101]: %%timeit temp = arr[:,:,np.arange(50)]
...: res = np.empty(temp.shape + (3,), temp.dtype)
...: res[...] = temp[...,None]
...:
...:
...:
337 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Same ball park.
Another trick is to play with strides to make a 'virtual' copy:
In [74]: res1 = np.broadcast_to(arr, (3,)+arr.shape)
In [75]: res1.shape
Out[75]: (3, 512, 512, 200)
In [76]: res1.strides
Out[76]: (0, 819200, 1600, 8)
For some reason this does not work with (512,512,200,3). It may have something to do with the broadcast_to implementation. Maybe someone can experiment with as_strided.
Though I can transpose this just fine:
np.broadcast_to(arr, (3,)+arr.shape).transpose(1,2,3,0)
In any case this is much faster:
In [82]: timeit res1 = np.broadcast_to(arr, (3,)+arr.shape)
10.4 µs ± 188 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
(but making a copy brings time back up.)
There is a list of tuples l = [(x,y,z), (x,y,z), (x,y,z)]
The idea is to find the fastest way to create different np.arrays for each x-s, y-s, z-s. Need help with finding the fastest solution to do it. To make speed comparison I use code attached below
import time
def myfast():
code
n = 1000000
t0 = time.time()
for i in range(n): myfast()
t1 = time.time()
total_n = t1-t0
1. np.array([i[0] for i in l])
np.array([i[1] for i in l])
np.array([i[2] for i in l])
output: 0.9980638027191162
2. array_x = np.zeros((len(l), 1), dtype="float")
array_y = np.zeros((len(l), 1), dtype="float")
array_z = np.zeros((len(l), 1), dtype="float")
for i, zxc in enumerate(l):
array_x[i] = zxc[0]
array_y[i] = zxc[1]
array_z[i] = zxc[2]
output 5.5509934425354
3. [np.array(x) for x in zip(*l)]
output 2.5070037841796875
5. array_x, array_y, array_z = np.array(list(zip(*l)))
output 2.725318431854248
There are some really good option in here, so I summarized them and compared speed:
import numpy as np
def f1(input_data):
array_x = np.array([elem[0] for elem in input_data])
array_y = np.array([elem[1] for elem in input_data])
array_z = np.array([elem[2] for elem in input_data])
return array_x, array_y, array_z
def f2(input_data):
array_x = np.zeros((len(input_data), ), dtype="float")
array_y = np.zeros((len(input_data), ), dtype="float")
array_z = np.zeros((len(input_data), ), dtype="float")
for i, elem in enumerate(input_data):
array_x[i] = elem[0]
array_y[i] = elem[1]
array_z[i] = elem[2]
return array_x, array_y, array_z
def f3(input_data):
return [np.array(elem) for elem in zip(*input_data)]
def f4(input_data):
return np.array(list(zip(*input_data)))
def f5(input_data):
return np.array(input_data).transpose()
def f6(input_data):
array_all = np.array(input_data)
array_x = array_all[:, 0]
array_y = array_all[:, 1]
array_z = array_all[:, 2]
return array_x, array_y, array_z
First I asserted that all of them return the same data (using np.array_equal()):
data = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
for array_list in zip(f1(data), f2(data), f3(data), f4(data), f5(data), f6(data)):
# print()
# for i, arr in enumerate(array_list):
# print('array from function', i+1)
# print(arr)
for i, arr in enumerate(array_list[:-1]):
assert np.array_equal(arr, array_list[i+1])
And the time comparisson:
import timeit
for f in [f1, f2, f3, f4, f5, f6]:
t = timeit.timeit('f(data)', 'from __main__ import data, f', number=100000)
print('{:5s} {:10.4f} seconds'.format(f.__name__, t))
gives these results:
data = [(1, 2, 3), (4, 5, 6), (7, 8, 9)] # 3 tuples
timeit number=100000
f1 0.3184 seconds
f2 0.4013 seconds
f3 0.2826 seconds
f4 0.2091 seconds
f5 0.1732 seconds
f6 0.2159 seconds
data = [(1, 2, 3) for _ in range(10**6)] # 1 millon tuples
timeit number=10
f1 2.2168 seconds
f2 2.8657 seconds
f3 2.0150 seconds
f4 1.9790 seconds
f5 2.6380 seconds
f6 2.6586 seconds
making f5() the fastest option for short input and f4() the fastest option for big input.
If the number of elements in each tuple will be more than 3, then only 3 functions apply to that case (the others are hardcoded for 3 elements in each tuple):
data = [tuple(range(10**4)) for _ in range(10**3)]
timeit number=10
f3 11.8396 seconds
f4 13.4672 seconds
f5 4.6251 seconds
making f5() again the fastest option for these criteria.
you could try:
import numpy
array_x, array_y, array_z = numpy.array(list(zip(*l)))
or just:
numpy.array(list(zip(*l)))
and more elegant way:
numpy.array(l).transpose()
Maybe I am missing something, but why not just pass list of tuples directly to np.array? Say if:
n = 100
l = [(0, 1, 2) for _ in range(n)]
arr = np.array(l)
x = arr[:, 0]
y = arr[:, 1]
z = arr[:, 2]
Btw, I prefer to use the following to time code:
from timeit import default_timer as timer
t0 = timer()
do_heavy_calculation()
print("Time taken [sec]:", timer() - t0)
I believe most (but not all) of the ingredients of this answer are actually present in the other answers, but on all the answers so far I have not seen a apple-to-apple comparison, in the sense that some approaches were not returning a list of np.ndarray objects, but rather a (convenient in my opinion) single np.ndarray().
It is not clear whether this is acceptable to you, so I am adding proper code for this.
Besides that the performances may be different because is some cases you are adding an extra step, while for some others you may not need to create large objects (that could reside in different memory pages).
In the end, for smaller inputs (3 x 10), the list of np.ndarray()s is just some additional burden that adds up significantly to the timing.
For larger inputs (3 x 1000) and above the extra computation is not significant any longer, but an approach involving comprehensions and avoiding the creation of a large numpy array can becomes as fast as (or even faster than) the fastest methods for smaller inputs.
Also, all the code I present work for arbitrary sizes of the tuples/list (as long as the inner tuples all have the same size, of course).
(EDIT: added a comment on the final results)
The tested methods are:
import numpy as np
def to_arrays_zip(items):
return np.array(list(zip(*items)))
def to_arrays_transpose(items):
return np.array(items).transpose()
def to_arrays_zip_split(items):
return [arr for arr in np.array(list(zip(*items)))]
def to_arrays_transpose_split(items):
return [arr for arr in np.array(items).transpose()]
def to_arrays_comprehension(items):
return [np.array([items[i][j] for i in range(len(items))]) for j in range(len(items[0]))]
def to_arrays_comprehension2(items):
return [np.array([item[j] for item in items]) for j in range(len(items[0]))]
(This is a convenient function to check that the results are the same.)
def test_equal(items1, items2):
return all(np.all(x == y) for x, y in zip(items1, items2))
For small inputs:
N = 3
M = 10
ll = [tuple(range(N)) for _ in range(M)]
print(to_arrays_comprehension2(ll))
print('Returning `np.ndarray()`')
%timeit to_arrays_zip(ll)
# 2.82 µs ± 28 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit to_arrays_transpose(ll)
# 3.18 µs ± 30 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
print('Returning a list')
%timeit to_arrays_zip_split(ll)
# 3.71 µs ± 47 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit to_arrays_transpose_split(ll)
# 3.97 µs ± 42.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit to_arrays_comprehension(ll)
# 5.91 µs ± 96.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit to_arrays_comprehension2(ll)
# 5.14 µs ± 109 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Where the podium is:
to_arrays_zip_split() (the non-_split if you are OK with a single array)
to_arrays_zip_transpose_split() (the non-_split if you are OK with a single array)
to_arrays_comprehension2()
For somewhat larger inputs:
N = 3
M = 1000
ll = [tuple(range(N)) for _ in range(M)]
print('Returning `np.ndarray()`')
%timeit to_arrays_zip(ll)
# 146 µs ± 2.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit to_arrays_transpose(ll)
# 222 µs ± 2.01 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
print('Returning a list')
%timeit to_arrays_zip_split(ll)
# 147 µs ± 1.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit to_arrays_transpose_split(ll)
# 221 µs ± 2.58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit to_arrays_comprehension(ll)
# 261 µs ± 2.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit to_arrays_comprehension2(ll)
# 212 µs ± 1.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The podium becomes:
to_arrays_zip_split() (whether you use the _split or non-_split variants, does not make much difference)
to_arrays_comprehension2()
to_arrays_zip_transpose_split() (whether you use the _split or non-_split variants, does not make much difference)
For even larger inputs:
N = 3
M = 1000000
ll = [tuple(range(N)) for _ in range(M)]
print('Returning `np.ndarray()`')
%timeit to_arrays_zip(ll)
# 215 ms ± 4.27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit to_arrays_transpose(ll)
# 220 ms ± 4.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
print('Returning a list')
%timeit to_arrays_zip_split(ll)
# 218 ms ± 6.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit to_arrays_transpose_split(ll)
# 222 ms ± 3.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit to_arrays_comprehension(ll)
# 248 ms ± 3.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit to_arrays_comprehension2(ll)
# 186 ms ± 481 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The podium becomes:
to_arrays_comprehension2()
to_arrays_zip_split() (whether you use the _split or non-_split variants, does not make much difference)
to_arrays_zip_transpose_split() (whether you use the _split or non-_split variants, does not make much difference)
and the the _zip and _transpose variants are pretty close to each other.
(I also tried to speed things up with Numba that didn't go well)
I am wondering if there is any downside of using b = np.array(a) rather than b = np.copy(a) to copy a Numpy array a into b. When I %timeit, the former can be upto 100% faster.
In both cases b is a is False, and I can manipulate b leaving a intact, so I suppose this does what is expected from .copy().
Am I missing anything? What is improper about using np.array to do copy an array?
with python 3.6.5, numpy 1.14.2, while the speed difference closes rapidly for larger sizes:
a = np.arange(1000)
%timeit np.array(a)
501 ns ± 30.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit np.copy(a)
1.1 µs ± 35.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
From documentation of numpy.copy:
This is equivalent to:
>>> np.array(a, copy=True)
Also, if you look at the source code:
def copy(a, order='K'):
return array(a, order=order, copy=True)
Some timings:
In [1]: import numpy as np
In [2]: a = np.ascontiguousarray(np.random.randint(0, 20000, 1000))
In [3]: %timeit b = np.array(a)
562 ns ± 10.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [4]: %timeit b = np.array(a, order='K', copy=True)
1.1 µs ± 10.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [5]: %timeit b = np.copy(a)
1.21 µs ± 9.28 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [6]: a = np.ascontiguousarray(np.random.randint(0, 20000, 1000000))
In [7]: %timeit b = np.array(a)
310 µs ± 6.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [8]: %timeit b = np.array(a, order='K', copy=True)
311 µs ± 2.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [9]: %timeit b = np.copy(a)
313 µs ± 4.33 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [10]: print(np.__version__)
1.13.3
It is unexpected that simply explicitly setting parameters to their default values changes the speed of execution of np.array(). On the other hand, maybe just processing these explicit arguments adds enough execution time to make a difference for small arrays. Indeed, from the source code for the numpy.array(), one can see that there are many more checks and more processing being performed when keyword arguments are provided, for example, see goto full_path. When keyword parameters are not set, the execution skips all the way down to goto finish. This overhead (of additional processing of keyword arguments) is what you detect in timings for small arrays. For larger arrays this overhead is insignificant in comparison to the actual time of copying the arrays.
"What is improper about using np.array to do copy an array?"
I'd argue it is harder to read. Because it is not obvious that array makes a copy, for example, the similar asarray does not make a copy if it doesn't have to. The reader basically has to know the default value of the copy keyword argument to be sure.
As AGN pointed out, np.array is faster than np.copy because essentially the latter is a wrapper of the former. This means python "loses" some extra time searching for both functions. A similar thing happens with decorators.
This extra time is insignificant for pratical purposes, and you gain better code readability.
You can test it by using a big array (where the array creation takes the main time), and you'll see very little differences in %timeit for both.