Create random integer ndarray sampled from different span per element - python

I want to generate an ndarray a full of random integers which are sampled from different ranges according to another array span. For example:
import numpy as np
span = [5,6,7,8,9]
def get_a(span, count):
a = np.stack([np.random.choice(i, count) for i in span], axis=0)
return a
get_a(span,2)
Is there a fast way to do get_a?

Yes. Yours:
import timeit
import numpy as np
span = np.arange(1,100)
def get_a(span, count):
a = np.stack([np.random.choice(i, count) for i in span], axis=0)
return a
%timeit get_a(span,2)
2.32 ms ± 254 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
My solution is 100s times faster for largish arrays:
def get_b(span, count):
b = (np.random.rand(len(span), count)*span[:,None]).astype(int)
return b
%timeit get_b(span,2)
6.91 µs ± 267 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Related

What's under the hood of numpy's 'mean' function such that it works faster than built in python methods?

I've been exploring the performance differences between numpy functions and the normal built-in functions of Python, and I want to know how numpy functions are so optimized such that there's almost a 100x speed up.
Below is some code that I wrote to highlight the execution time differences between numpy mean() and manual calculation of mean using sum() and len()
import numpy as np
import time
n = 10**7
a = np.random.randn(n)
start = time.perf_counter()
mean = sum(a)/len(a)
seconds1 = time.perf_counter()-start
start = time.perf_counter()
mean = np.mean(a)
seconds2 = time.perf_counter()-start
print("First method takes time {:.3f}s".format(seconds1))
print("Second method takes time {:.3f}s".format(seconds2))
Output:-
First method takes 1.687s
Second method takes 0.013s
Make a numpy array:
In [130]: a=np.arange(10000)
Apply the numpy sum function:
In [131]: timeit np.sum(a)
16.2 µs ± 22.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
mean is a bit slower, since it has to divide by the shape (and may do a few other tests):
In [132]: timeit np.mean(a)
34.9 µs ± 198 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
np.sum actually delegates the action to the sum method of the array, so using that directly is a bit faster:
In [133]: timeit a.sum()
13.3 µs ± 25.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Python sum isn't a bad function, but it iterates over its argument. Iterating (in Python code) on an array is slow:
In [134]: timeit sum(a)
1.16 ms ± 2.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Converting the array to a list first saves time:
In [135]: timeit sum(a.tolist())
369 µs ± 7.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Better yet if we just time the list operation:
In [136]: %%timeit alist=a.tolist()
...: sum(alist)
57.2 µs ± 294 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
When working with numpy arrays, it is best to use its own methods (or numpy functions). Generally when using Python functions, it is better to use lists.
Using a numpy function on a list is slow, because it has to first convert the list to an array:
In [137]: %%timeit alist=a.tolist()
...: np.sum(alist)
795 µs ± 28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Fastest way to average sign-normalized segments of data with NumPy?

What would be the fastest way to collect segments of data from a NumPy array at every point in a dataset, normalize them based on the sign (+ve/-ve) at the start of the segment, and average all segments together?
At present I have:
import numpy as np
x0 = np.random.normal(0,1,5000) # Dataset to be analysed
l0 = 100 # Length of segment to be averaged
def average_seg(x,l):
return np.mean([x[i:i+l]*np.sign(x[i]) for i in range(len(x)-l)],axis=0)
av_seg = average_seg(x0,l0)
Timing for this is as follows:
%timeit average_seg(x0,l0)
22.2 ms ± 362 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
This does the job, but is there a faster way to do this?
The above code suffers when the length of x0 is large, and when the value of l0 is large. We're looking at looping through this code several million times, so even incremental improvements will help!
We can leverage 1D convolution -
np.convolve(x,np.sign(x[:-l+1][::-1]),'valid')/(len(x)-l+1)
The idea is to do the windowed summations with convolution and with a flipped kernel as per the convolution definition.
Timings -
In [150]: x = np.random.normal(0,1,5000) # Dataset to be analysed
...: l = 100 # Length of segment to be averaged
In [151]: %timeit average_seg(x,l)
17.2 ms ± 689 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [152]: %timeit np.convolve(x,np.sign(x[:-l+1][::-1]),'valid')/(len(x)-l+1)
149 µs ± 3.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [153]: av_seg = average_seg(x,l)
...: out = np.convolve(x,np.sign(x[:-l+1][::-1]),'valid')/(len(x)-l+1)
...: print(np.allclose(out, av_seg))
True
100x+ speedup!

How to vectorize custom algorithms in numpy or pytorch?

Suppose I have two matrices:
A: size k x m
B: size m x n
Using a custom operation, my output will be k x n.
This custom operation is not a dot product between the rows of A and columns of B. Suppose this custom operation is defined as:
For the Ith row of A and Jth column of B, the i,j element of the output is:
sum( (a[i] + b[j]) ^20 ), i loop over I, j loops over J
The only way I can see to implement this is to expand this equation, calculate each term, them sum them.
Is there a way in numpy or pytorch to do this without expanding the equation?
Apart from the method #hpaulj outlines in the comments, you can also use the fact that what you are calculating is essentially a pair-wise Minkowski distance:
import numpy as np
from scipy.spatial.distance import cdist
k,m,n = 10,20,30
A = np.random.random((k,m))
B = np.random.random((m,n))
method1 = ((A[...,None]+B)**20).sum(axis=1)
method2 = cdist(A,-B.T,'m',p=20)**20
np.allclose(method1,method2)
# True
You can implement it yourself
The following function generates all kind of dot product like functions, but don't use it to replace np.dot, because it will be quite a lot slower for larger arrays.
Template
import numpy as np
import numba as nb
from scipy.spatial.distance import cdist
def gen_dot_like_func(kernel,parallel=True):
kernel_nb=nb.njit(kernel,fastmath=True)
def cust_dot(A,B_in):
B=np.ascontiguousarray(B_in.T)
assert B.shape[1]==A.shape[1]
out=np.empty((A.shape[0],B.shape[0]),dtype=A.dtype)
for i in nb.prange(A.shape[0]):
for j in range(B.shape[0]):
sum=0
for k in range(A.shape[1]):
sum+=kernel_nb(A[i,k],B[j,k])
out[i,j]=sum
return out
if parallel==True:
return nb.njit(cust_dot,fastmath=True,parallel=True)
else:
return nb.njit(cust_dot,fastmath=True,parallel=False)
Generate your function
#This can be useful if you have a lot matrix-multiplication like functions
my_func=gen_dot_like_func(lambda A,B:(A+B)**20,parallel=True)
Timings
k,m,n = 10,20,30
%timeit method1 = ((A[...,None]+B)**20).sum(axis=1)
192 µs ± 554 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit method2 = cdist(A,-B.T,'m',p=20)**20
208 µs ± 1.85 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit res=my_func(A,B) #parallel=False
4.01 µs ± 34.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
k,m,n = 500,100,500
timeit method1 = ((A[...,None]+B)**20).sum(axis=1)
852 ms ± 4.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit method2 = cdist(A,-B.T,'m',p=20)**20
714 ms ± 2.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res=my_func(A,B) #parallel=True
1.81 ms ± 11.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Improve Harmonic Mean efficiency in Pandas pivot_table

I'm applying harmonic mean from scipy.stats for aggfunc parameter in Pandas pivot_table but it is much slower than a simple mean by orders of magnitude.
I would like to know if this is excepted behavior or there is a way to turn this calculation more efficient as I need to do this calculation thousands of times.
I need to use harmonic mean but this is taking a huge amount of processing time.
I've tried using harmonic_mean from statistics form Python 3.6 but still the overhead is the same.
Thanks
import numpy as np
import pandas as pd
import statistics
data = pd.DataFrame({'value1':np.random.randint(1000,size=200000),
'value2':np.random.randint(24,size=200000),
'value3':np.random.rand(200000)+1,
'value4':np.random.randint(100000,size=200000)})
%timeit result = pd.pivot_table(data,index='value1',columns='value2',values='value3',aggfunc=hmean)
1.74 s ± 24.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit result = pd.pivot_table(data,index='value1',columns='value2',values='value3',aggfunc=lambda x: statistics.harmonic_mean(list(x)))
1.9 s ± 26.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit result = pd.pivot_table(data,index='value1',columns='value2',values='value3',aggfunc=np.mean)
37.4 ms ± 938 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#Single run for both functions
%timeit hmean(data.value3[:100])
155 µs ± 3.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.mean(data.value3[:100])
138 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
I would recommend using multiprocessing.Pool, the code below has been tested for 20 million records, it is 3 times faster than the original, give it try please, for sure code still needs more improvements to answer your specific question about the slow performance of statistics.harmonic_mean.
note: you can get even better results for records > 100 M.
import time
import numpy as np
import pandas as pd
import statistics
import multiprocessing
data = pd.DataFrame({'value1':np.random.randint(1000,size=20000000),
'value2':np.random.randint(24,size=20000000),
'value3':np.random.rand(20000000)+1,
'value4':np.random.randint(100000,size=20000000)})
def chunk_pivot(data):
result = pd.pivot_table(data,index='value1',columns='value2',values='value3',aggfunc=lambda x: statistics.harmonic_mean(list(x)))
return result
DataFrameDict=[]
for i in range(4):
print(i*250,i*250+250)
DataFrameDict.append(data[:][data.value1.between(i*250,i*250+249)])
def parallel_pivot(prcsr):
# 6 is a number of processes I've tested
p = multiprocessing.Pool(prcsr)
out_df=[]
for result in p.imap(chunk_pivot, DataFrameDict):
#print (result)
out_df.append(result)
return out_df
start =time.time()
dict_pivot=parallel_pivot(6)
multiprocessing_result=pd.concat(dict_pivot,axis=0)
#singleprocessing_result = pd.pivot_table(data,index='value1',columns='value2',values='value3',aggfunc=lambda x: statistics.harmonic_mean(list(x)))
end = time.time()
print(end-start)

What's the fastest, most efficient, and pythonic way to perform a mathematical sigma sum?

Let's say that I want to perform a mathematical summation, say the Madhava–Leibniz formula for π, in Python:
Within a function called Leibniz_pi(), I could create a loop to calculate the nth partial sum, such as:
def Leibniz_pi(n):
nth_partial_sum = 0 #initialize the variable
for i in range(n+1):
nth_partial_sum += ((-1)**i)/(2*i + 1)
return nth_partial_sum
I'm assuming it would be faster to use something like xrange() instead of range(). Would it be even faster to use numpy and its built in numpy.sum() method? What would such an example look like?
I guess most people will define the fastest solution by #zero using only numpy as the most pythonic, but it is certainly not the fastest. With some additional optimizations you can beat the already fast numpy implementation by a factor of 50.
Using only Numpy (#zero)
import numpy as np
import numexpr as ne
import numba as nb
def Leibniz_point(n):
val = (-1)**n / (2*n + 1)
return val
%timeit Leibniz_point(np.arange(1000)).sum()
33.8 µs ± 203 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Make use of numexpr
n=np.arange(1000)
%timeit ne.evaluate("sum((-1)**n / (2*n + 1))")
21 µs ± 354 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Compile your function using Numba
# with error_model="numpy", turns off division-by-zero checks
#nb.njit(error_model="numpy",cache=True)
def Leibniz_pi(n):
nth_partial_sum = 0. #initialize the variable as float64
for i in range(n+1):
nth_partial_sum += ((-1)**i)/(2*i + 1)
return nth_partial_sum
%timeit Leibniz_pi(999)
6.48 µs ± 38.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Edit, optimizing away the costly (-1)**n
import numba as nb
import numpy as np
#replacement for the much more costly (-1)**n
#nb.njit()
def sgn(i):
if i%2>0:
return -1.
else:
return 1.
# with error_model="numpy", turns off the division-by-zero checks
#
# fastmath=True makes SIMD-vectorization in this case possible
# floating point math is in general not commutative
# e.g. calculating four times sgn(i)/(2*i + 1) at once and then the sum
# is not exactly the same as doing this sequentially, therefore you have to
# explicitly allow the compiler to make the optimizations
#nb.njit(fastmath=True,error_model="numpy",cache=True)
def Leibniz_pi(n):
nth_partial_sum = 0. #initialize the variable
for i in range(n+1):
nth_partial_sum += sgn(i)/(2*i + 1)
return nth_partial_sum
%timeit Leibniz_pi(999)
777 ns ± 5.36 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
3 suggestions (with speed computation):
define the Leibniz point not the cumulative sum:
def Leibniz_point(n):
val = (-1)**n / (2*n + 1)
return val
1) sum a list comprehension
%timeit sum([Leibniz_point(n) for n in range(100)])
58.8 µs ± 825 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit sum([Leibniz_point(n) for n in range(1000)])
667 µs ± 3.41 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2) standard for loop
%%timeit
sum = 0
for n in range(100):
sum += Leibniz_point(n)
61.8 µs ± 4.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
sum = 0
for n in range(1000):
sum += Leibniz_point(n)
729 µs ± 43.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3) use a numpy array (suggested)
%timeit Leibniz_point(np.arange(100)).sum()
11.5 µs ± 866 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit Leibniz_point(np.arange(1000)).sum()
61.8 µs ± 3.69 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In general, for operations involving collections of more than a few elements, numpy will be faster. A simple numpy implementation could be something like this:
def leibniz(n):
a = np.arange(n + 1)
return (((-1.0) ** a) / (2 * a + 1)).sum()
Note that you must specify that the numerator is a float with 1.0 on Python 2. On Python 3, 1 will be fine.

Categories