Basically I want to map over each value of a multidimensional numpy array. The output should have the same shape as the input.
This is the way I did it:
def f(x):
return x*x
input = np.arange(3*4*5).reshape(3,4,5)
output = np.array(list(map(f, input)))
print(output)
It works, but it feels a bit too complicated (np.array, list, map). Is there a more elegant solution?
Just call your function on the array:
f(input)
Also, better not to use the name input for your variable as it is a builtin:
import numpy as np
def f(x):
return x*x
arr = np.arange(3*4*5).reshape(3,4,5)
print(np.alltrue(f(arr) == np.array(list(map(f, input)))))
Output:
True
If the function is more complex:
def f(x):
return x+1 if x%2 else 2*x
use vectorize:
np.vectorize(f)(arr)
Better, always try to use the vectorized NumPy functions such as np.where:
>>> np.alltrue(np.vectorize(f)(arr) == np.where(arr % 2, arr + 1, arr * 2))
True
The native NumPy version is considerably faster:
%%timeit
np.vectorize(f)(arr)
34 µs ± 996 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
np.where(arr % 2, arr + 1, arr * 2)
5.16 µs ± 128 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
This is much more pronounced for larger arrays:
big_arr = np.arange(30 * 40 * 50).reshape(30, 40, 50)
%%timeit
np.vectorize(f)(big_arr)
15.5 ms ± 318 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
np.where(big_arr % 2, big_arr + 1, big_arr * 2)
797 µs ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Related
I'm using numpy (ideally Numba) to perform a tensor contraction that involves three tensors, one of which is a vector that should multiply only one index of the others. For example,
A = np.random.normal(size=(20,20,20,20))
B = np.random.normal(size=(20,20,20,20))
v = np.sqrt(np.arange(20))
# e.g. v on the 3rd index
>>> %timeit np.vdot(A * v[None, None, :, None], B)
125 µs ± 5.14 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
compare with
C = np.random.normal(size=(20,20,20,20))
>>> %timeit np.vdot(A * C, B)
76.8 µs ± 4.25 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Is there a more efficient way of including the product with v? It feels wrong that it should be slower than multiplying by the full tensor C.
I could squeeze some performance by using numba with parallel=True
import numba as nb
import numpy as np
N = 50
#nb.njit('float64(float64[:,:,:,:], float64[:,:,:,:],float64[:])',parallel=True)
def dotABv(a, b,vv):
res = 0.0
for i in nb.prange(a.shape[0]):
for j in range(a.shape[1]):
for k in range(a.shape[2]):
res += vv[k]*np.dot(a[i,j,k,:],b[i,j,k,:])
return res
v = np.sqrt(np.arange(N))
A = np.random.normal(size=(N,N,N,N))
B = np.random.normal(size=(N,N,N,N))
C = np.random.normal(size=(N,N,N,N))
%timeit dotABv(A,B,v)
%timeit np.vdot(A , B) ## just to compare with dot
%timeit np.vdot(A * v[None, None, :, None], B)
# Output :
# 501 µs ± 1.42 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# 1.1 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# 14.7 ms ± 239 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
print(dotABv(A,B,v), np.vdot(A * v[None, None, :, None], B))
# 5105.504508154087 5105.5045081541075
Suppose I have two matrices:
A: size k x m
B: size m x n
Using a custom operation, my output will be k x n.
This custom operation is not a dot product between the rows of A and columns of B. Suppose this custom operation is defined as:
For the Ith row of A and Jth column of B, the i,j element of the output is:
sum( (a[i] + b[j]) ^20 ), i loop over I, j loops over J
The only way I can see to implement this is to expand this equation, calculate each term, them sum them.
Is there a way in numpy or pytorch to do this without expanding the equation?
Apart from the method #hpaulj outlines in the comments, you can also use the fact that what you are calculating is essentially a pair-wise Minkowski distance:
import numpy as np
from scipy.spatial.distance import cdist
k,m,n = 10,20,30
A = np.random.random((k,m))
B = np.random.random((m,n))
method1 = ((A[...,None]+B)**20).sum(axis=1)
method2 = cdist(A,-B.T,'m',p=20)**20
np.allclose(method1,method2)
# True
You can implement it yourself
The following function generates all kind of dot product like functions, but don't use it to replace np.dot, because it will be quite a lot slower for larger arrays.
Template
import numpy as np
import numba as nb
from scipy.spatial.distance import cdist
def gen_dot_like_func(kernel,parallel=True):
kernel_nb=nb.njit(kernel,fastmath=True)
def cust_dot(A,B_in):
B=np.ascontiguousarray(B_in.T)
assert B.shape[1]==A.shape[1]
out=np.empty((A.shape[0],B.shape[0]),dtype=A.dtype)
for i in nb.prange(A.shape[0]):
for j in range(B.shape[0]):
sum=0
for k in range(A.shape[1]):
sum+=kernel_nb(A[i,k],B[j,k])
out[i,j]=sum
return out
if parallel==True:
return nb.njit(cust_dot,fastmath=True,parallel=True)
else:
return nb.njit(cust_dot,fastmath=True,parallel=False)
Generate your function
#This can be useful if you have a lot matrix-multiplication like functions
my_func=gen_dot_like_func(lambda A,B:(A+B)**20,parallel=True)
Timings
k,m,n = 10,20,30
%timeit method1 = ((A[...,None]+B)**20).sum(axis=1)
192 µs ± 554 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit method2 = cdist(A,-B.T,'m',p=20)**20
208 µs ± 1.85 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit res=my_func(A,B) #parallel=False
4.01 µs ± 34.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
k,m,n = 500,100,500
timeit method1 = ((A[...,None]+B)**20).sum(axis=1)
852 ms ± 4.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit method2 = cdist(A,-B.T,'m',p=20)**20
714 ms ± 2.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res=my_func(A,B) #parallel=True
1.81 ms ± 11.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Let's say that I want to perform a mathematical summation, say the Madhava–Leibniz formula for π, in Python:
Within a function called Leibniz_pi(), I could create a loop to calculate the nth partial sum, such as:
def Leibniz_pi(n):
nth_partial_sum = 0 #initialize the variable
for i in range(n+1):
nth_partial_sum += ((-1)**i)/(2*i + 1)
return nth_partial_sum
I'm assuming it would be faster to use something like xrange() instead of range(). Would it be even faster to use numpy and its built in numpy.sum() method? What would such an example look like?
I guess most people will define the fastest solution by #zero using only numpy as the most pythonic, but it is certainly not the fastest. With some additional optimizations you can beat the already fast numpy implementation by a factor of 50.
Using only Numpy (#zero)
import numpy as np
import numexpr as ne
import numba as nb
def Leibniz_point(n):
val = (-1)**n / (2*n + 1)
return val
%timeit Leibniz_point(np.arange(1000)).sum()
33.8 µs ± 203 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Make use of numexpr
n=np.arange(1000)
%timeit ne.evaluate("sum((-1)**n / (2*n + 1))")
21 µs ± 354 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Compile your function using Numba
# with error_model="numpy", turns off division-by-zero checks
#nb.njit(error_model="numpy",cache=True)
def Leibniz_pi(n):
nth_partial_sum = 0. #initialize the variable as float64
for i in range(n+1):
nth_partial_sum += ((-1)**i)/(2*i + 1)
return nth_partial_sum
%timeit Leibniz_pi(999)
6.48 µs ± 38.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Edit, optimizing away the costly (-1)**n
import numba as nb
import numpy as np
#replacement for the much more costly (-1)**n
#nb.njit()
def sgn(i):
if i%2>0:
return -1.
else:
return 1.
# with error_model="numpy", turns off the division-by-zero checks
#
# fastmath=True makes SIMD-vectorization in this case possible
# floating point math is in general not commutative
# e.g. calculating four times sgn(i)/(2*i + 1) at once and then the sum
# is not exactly the same as doing this sequentially, therefore you have to
# explicitly allow the compiler to make the optimizations
#nb.njit(fastmath=True,error_model="numpy",cache=True)
def Leibniz_pi(n):
nth_partial_sum = 0. #initialize the variable
for i in range(n+1):
nth_partial_sum += sgn(i)/(2*i + 1)
return nth_partial_sum
%timeit Leibniz_pi(999)
777 ns ± 5.36 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
3 suggestions (with speed computation):
define the Leibniz point not the cumulative sum:
def Leibniz_point(n):
val = (-1)**n / (2*n + 1)
return val
1) sum a list comprehension
%timeit sum([Leibniz_point(n) for n in range(100)])
58.8 µs ± 825 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit sum([Leibniz_point(n) for n in range(1000)])
667 µs ± 3.41 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2) standard for loop
%%timeit
sum = 0
for n in range(100):
sum += Leibniz_point(n)
61.8 µs ± 4.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
sum = 0
for n in range(1000):
sum += Leibniz_point(n)
729 µs ± 43.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3) use a numpy array (suggested)
%timeit Leibniz_point(np.arange(100)).sum()
11.5 µs ± 866 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit Leibniz_point(np.arange(1000)).sum()
61.8 µs ± 3.69 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In general, for operations involving collections of more than a few elements, numpy will be faster. A simple numpy implementation could be something like this:
def leibniz(n):
a = np.arange(n + 1)
return (((-1.0) ** a) / (2 * a + 1)).sum()
Note that you must specify that the numerator is a float with 1.0 on Python 2. On Python 3, 1 will be fine.
There is a list of tuples l = [(x,y,z), (x,y,z), (x,y,z)]
The idea is to find the fastest way to create different np.arrays for each x-s, y-s, z-s. Need help with finding the fastest solution to do it. To make speed comparison I use code attached below
import time
def myfast():
code
n = 1000000
t0 = time.time()
for i in range(n): myfast()
t1 = time.time()
total_n = t1-t0
1. np.array([i[0] for i in l])
np.array([i[1] for i in l])
np.array([i[2] for i in l])
output: 0.9980638027191162
2. array_x = np.zeros((len(l), 1), dtype="float")
array_y = np.zeros((len(l), 1), dtype="float")
array_z = np.zeros((len(l), 1), dtype="float")
for i, zxc in enumerate(l):
array_x[i] = zxc[0]
array_y[i] = zxc[1]
array_z[i] = zxc[2]
output 5.5509934425354
3. [np.array(x) for x in zip(*l)]
output 2.5070037841796875
5. array_x, array_y, array_z = np.array(list(zip(*l)))
output 2.725318431854248
There are some really good option in here, so I summarized them and compared speed:
import numpy as np
def f1(input_data):
array_x = np.array([elem[0] for elem in input_data])
array_y = np.array([elem[1] for elem in input_data])
array_z = np.array([elem[2] for elem in input_data])
return array_x, array_y, array_z
def f2(input_data):
array_x = np.zeros((len(input_data), ), dtype="float")
array_y = np.zeros((len(input_data), ), dtype="float")
array_z = np.zeros((len(input_data), ), dtype="float")
for i, elem in enumerate(input_data):
array_x[i] = elem[0]
array_y[i] = elem[1]
array_z[i] = elem[2]
return array_x, array_y, array_z
def f3(input_data):
return [np.array(elem) for elem in zip(*input_data)]
def f4(input_data):
return np.array(list(zip(*input_data)))
def f5(input_data):
return np.array(input_data).transpose()
def f6(input_data):
array_all = np.array(input_data)
array_x = array_all[:, 0]
array_y = array_all[:, 1]
array_z = array_all[:, 2]
return array_x, array_y, array_z
First I asserted that all of them return the same data (using np.array_equal()):
data = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
for array_list in zip(f1(data), f2(data), f3(data), f4(data), f5(data), f6(data)):
# print()
# for i, arr in enumerate(array_list):
# print('array from function', i+1)
# print(arr)
for i, arr in enumerate(array_list[:-1]):
assert np.array_equal(arr, array_list[i+1])
And the time comparisson:
import timeit
for f in [f1, f2, f3, f4, f5, f6]:
t = timeit.timeit('f(data)', 'from __main__ import data, f', number=100000)
print('{:5s} {:10.4f} seconds'.format(f.__name__, t))
gives these results:
data = [(1, 2, 3), (4, 5, 6), (7, 8, 9)] # 3 tuples
timeit number=100000
f1 0.3184 seconds
f2 0.4013 seconds
f3 0.2826 seconds
f4 0.2091 seconds
f5 0.1732 seconds
f6 0.2159 seconds
data = [(1, 2, 3) for _ in range(10**6)] # 1 millon tuples
timeit number=10
f1 2.2168 seconds
f2 2.8657 seconds
f3 2.0150 seconds
f4 1.9790 seconds
f5 2.6380 seconds
f6 2.6586 seconds
making f5() the fastest option for short input and f4() the fastest option for big input.
If the number of elements in each tuple will be more than 3, then only 3 functions apply to that case (the others are hardcoded for 3 elements in each tuple):
data = [tuple(range(10**4)) for _ in range(10**3)]
timeit number=10
f3 11.8396 seconds
f4 13.4672 seconds
f5 4.6251 seconds
making f5() again the fastest option for these criteria.
you could try:
import numpy
array_x, array_y, array_z = numpy.array(list(zip(*l)))
or just:
numpy.array(list(zip(*l)))
and more elegant way:
numpy.array(l).transpose()
Maybe I am missing something, but why not just pass list of tuples directly to np.array? Say if:
n = 100
l = [(0, 1, 2) for _ in range(n)]
arr = np.array(l)
x = arr[:, 0]
y = arr[:, 1]
z = arr[:, 2]
Btw, I prefer to use the following to time code:
from timeit import default_timer as timer
t0 = timer()
do_heavy_calculation()
print("Time taken [sec]:", timer() - t0)
I believe most (but not all) of the ingredients of this answer are actually present in the other answers, but on all the answers so far I have not seen a apple-to-apple comparison, in the sense that some approaches were not returning a list of np.ndarray objects, but rather a (convenient in my opinion) single np.ndarray().
It is not clear whether this is acceptable to you, so I am adding proper code for this.
Besides that the performances may be different because is some cases you are adding an extra step, while for some others you may not need to create large objects (that could reside in different memory pages).
In the end, for smaller inputs (3 x 10), the list of np.ndarray()s is just some additional burden that adds up significantly to the timing.
For larger inputs (3 x 1000) and above the extra computation is not significant any longer, but an approach involving comprehensions and avoiding the creation of a large numpy array can becomes as fast as (or even faster than) the fastest methods for smaller inputs.
Also, all the code I present work for arbitrary sizes of the tuples/list (as long as the inner tuples all have the same size, of course).
(EDIT: added a comment on the final results)
The tested methods are:
import numpy as np
def to_arrays_zip(items):
return np.array(list(zip(*items)))
def to_arrays_transpose(items):
return np.array(items).transpose()
def to_arrays_zip_split(items):
return [arr for arr in np.array(list(zip(*items)))]
def to_arrays_transpose_split(items):
return [arr for arr in np.array(items).transpose()]
def to_arrays_comprehension(items):
return [np.array([items[i][j] for i in range(len(items))]) for j in range(len(items[0]))]
def to_arrays_comprehension2(items):
return [np.array([item[j] for item in items]) for j in range(len(items[0]))]
(This is a convenient function to check that the results are the same.)
def test_equal(items1, items2):
return all(np.all(x == y) for x, y in zip(items1, items2))
For small inputs:
N = 3
M = 10
ll = [tuple(range(N)) for _ in range(M)]
print(to_arrays_comprehension2(ll))
print('Returning `np.ndarray()`')
%timeit to_arrays_zip(ll)
# 2.82 µs ± 28 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit to_arrays_transpose(ll)
# 3.18 µs ± 30 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
print('Returning a list')
%timeit to_arrays_zip_split(ll)
# 3.71 µs ± 47 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit to_arrays_transpose_split(ll)
# 3.97 µs ± 42.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit to_arrays_comprehension(ll)
# 5.91 µs ± 96.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit to_arrays_comprehension2(ll)
# 5.14 µs ± 109 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Where the podium is:
to_arrays_zip_split() (the non-_split if you are OK with a single array)
to_arrays_zip_transpose_split() (the non-_split if you are OK with a single array)
to_arrays_comprehension2()
For somewhat larger inputs:
N = 3
M = 1000
ll = [tuple(range(N)) for _ in range(M)]
print('Returning `np.ndarray()`')
%timeit to_arrays_zip(ll)
# 146 µs ± 2.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit to_arrays_transpose(ll)
# 222 µs ± 2.01 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
print('Returning a list')
%timeit to_arrays_zip_split(ll)
# 147 µs ± 1.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit to_arrays_transpose_split(ll)
# 221 µs ± 2.58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit to_arrays_comprehension(ll)
# 261 µs ± 2.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit to_arrays_comprehension2(ll)
# 212 µs ± 1.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The podium becomes:
to_arrays_zip_split() (whether you use the _split or non-_split variants, does not make much difference)
to_arrays_comprehension2()
to_arrays_zip_transpose_split() (whether you use the _split or non-_split variants, does not make much difference)
For even larger inputs:
N = 3
M = 1000000
ll = [tuple(range(N)) for _ in range(M)]
print('Returning `np.ndarray()`')
%timeit to_arrays_zip(ll)
# 215 ms ± 4.27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit to_arrays_transpose(ll)
# 220 ms ± 4.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
print('Returning a list')
%timeit to_arrays_zip_split(ll)
# 218 ms ± 6.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit to_arrays_transpose_split(ll)
# 222 ms ± 3.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit to_arrays_comprehension(ll)
# 248 ms ± 3.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit to_arrays_comprehension2(ll)
# 186 ms ± 481 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The podium becomes:
to_arrays_comprehension2()
to_arrays_zip_split() (whether you use the _split or non-_split variants, does not make much difference)
to_arrays_zip_transpose_split() (whether you use the _split or non-_split variants, does not make much difference)
and the the _zip and _transpose variants are pretty close to each other.
(I also tried to speed things up with Numba that didn't go well)
I've found that len(arr) is almost twice as fast as arr.shape[0] and am wondering why.
I am using Python 3.5.2, Numpy 1.14.2, IPython 6.3.1
The below code demonstrates this:
arr = np.random.randint(1, 11, size=(3, 4, 5))
%timeit len(arr)
# 62.6 ns ± 0.239 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit arr.shape[0]
# 102 ns ± 0.163 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
I've also done some more tests for comparison:
class Foo():
def __init__(self):
self.shape = (3, 4, 5)
foo = Foo()
%timeit arr.shape
# 75.6 ns ± 0.107 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit foo.shape
# 61.2 ns ± 0.281 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit foo.shape[0]
# 78.6 ns ± 1.03 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
So I have two questions:
1) Why does len(arr) works faster than arr.shape[0]? (I would have thought len would be slower because of the function call)
2) Why does foo.shape[0] work faster than arr.shape[0]? (In other words, what overhead does do numpy arrays incur in this case?)
The numpy array data structure is implemented in C. The dimensions of the array are stored in a C structure. They are not stored in a Python tuple. So each time you read the shape attribute, a new Python tuple of new Python integer objects is created. When you use arr.shape[0], that tuple is then indexed to pull out the first element, which adds a little more overhead. len(arr) only has to create a Python integer.
You can easily verify that arr.shape creates a new tuple each time it is read:
In [126]: arr = np.random.randint(1, 11, size=(3, 4, 5))
In [127]: s1 = arr.shape
In [128]: id(s1)
Out[128]: 4916019848
In [129]: s2 = arr.shape
In [130]: id(s2)
Out[130]: 4909905024
s1 and s2 have different ids; they are different tuple objects.