So I want to vectorize a for loop to speed things up. my code is the following:
import numpy as np
import pandas as pd
def my_func(array, n):
return pd.Series(array).ewm(span = n, min_periods = n-1).mean().to_numpy()
np.random.seed(0)
data_size = 120000
data = np.random.uniform(0,1000, size = data_size)+29000
loop_size = 1000
step_size = 1
X = np.zeros([data.shape[0], loop_size])
parameter_array = np.arange(1,loop_size+ step_size, step_size)
for i in parameter_array:
X[:, i-1] = my_func(data, i)
The entire for-loop takes up about a min to finish, which could be a problem for future application. I have already checked the numpy.vectorize(), but it states clearly that it is for convenience only, so using it won't speed up the code by an order of magnitude.
My question is that is there a way to vectorize the for loop like this? If so, can I see a simple example of how this can be done?
Thank you in advance
Related
I am trying to integrate over some matrix entries in Python. I want to avoid loops, because my tasks includes 1 Mio simulations. I am looking for a specification that will efficiently solve my problem.
I get the following error: only size-1 arrays can be converted to Python scalars
from scipy import integrate
import numpy.random as npr
n = 1000
m = 30
x = npr.standard_normal([n, m])
def integrand(k):
return k * x ** 2
integrate.quad(integrand, 0, 100)
This is a simplied example of my case. I have multiple nested functions, such that I cannot simple put x infront of the integral.
Well you might want to use parallel execution for this. It should be quite easy as long as you just want to execute integrate.quad 30000000 times. Just split your workload in little packages and give it to a threadpool. Of course the speedup is limited to the number of cores you have in your pc. I'm not a python programer but this should be possible. You can also increase epsabs and epsrel parameters in the quad function, depending on the implemetation this should speed up the programm as well. Of course you'll get a less precise result but this might be ok depending on your problem.
import threading
from scipy import integrate
import numpy.random as npr
n = 2
m = 3
x = npr.standard_normal([n,m])
def f(a):
for j in range(m):
integrand = lambda k: k * x[a,j]**2
i =integrate.quad(integrand, 0, 100)
print(i) ##write it to result array
for i in range(n):
threading.Thread(target=f(i)).start();
##better split it up even more and give it to a threadpool to avoid
##overhead because of thread init
This is maybe not the ideal solution but it should help a bit. You can use numpy.vectorize. Even the doc says: The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop. But still, a %timeit on the simple example you provided shows a 2.3x speedup.
The implementation is
from scipy import integrate
from numpy import vectorize
import numpy.random as npr
n = 1000
m = 30
x = npr.standard_normal([n,m])
def g(x):
integrand = lambda k: k * x**2
return integrate.quad(integrand, 0, 100)
vg = vectorize(g)
res = vg(x)
quadpy (a project of mine) does vectorized quadrature:
import numpy
import numpy.random as npr
import quadpy
x = npr.standard_normal([1000, 30])
def integrand(k):
return numpy.multiply.outer(x ** 2, k)
scheme = quadpy.line_segment.gauss_legendre(10)
val = scheme.integrate(integrand, [0, 100])
This is much faster than all other answers.
I am attempting to use the most time efficient cumsum possible on a 3D array, in python. I have tried numpy's cumsum, but found that simply using a manual parallelized method with numba:
import numpy as np
from numba import njit, prange
from timeit import default_timer as timer
#njit(parallel=True)
def cpu_cumsum(data, output):
for i in prange(200):
for j in prange(2000000):
output[i,j][0] = data[i,j][0]
for i in prange(1, 200):
for j in prange(1,2000000):
for k in range(1, 5):
output[i,j,k] = data[i,j,k] + output[i,j,k-1]
return output
data = np.float32(np.arange(2000000000).reshape(200, 2000000, 5))
output = np.empty_like(data)
func_start = timer()
output = cpu_cumsum(data, output)
timing=timer()-func_start
print("Function: manualCumSum duration (seconds):" + str(timing))
My method:
Function: manualCumSum duration (seconds):2.8496341188924994
np.cumsum:
Function: cumSum duration (seconds):6.182090314569933
While trying this with guvectorize, I found that it used too much memory for my GPU so I have since abandoned that avenue. Is there a better way to do this, or have I hit the end of the road?
PS: Speed needed due to looping around this many times.
I like to develop a query system that finds the most similar items to given one based on a binary signature extracted from the data. I probe for the most efficient way since I have runtime constraints. I tried to use scipy distance but it was too slow. Do you know any other useful library or trick to make it in a faster manner.
For being and example scenario,
I have a query vector with binary values with length 68, and I have a dataset with a matrix size 3000Kx68. I like to find the most similar item in this matrix to given query by using Hamming distance.
thanks for any comment
Nice problem, I liked the answers of Alex and Piotr. My first naive attempt resulted also in a solution time around 800ms (on my system). My second attempt, using numpy's (un)packbits, resulted in a 4x speed increase.
import numpy as np
LENGTH = 68
K = 1024
DATASIZE = 3000 * K
DATA = np.random.randint(0, 2, (DATASIZE, LENGTH)).astype(np.bool)
def RandomVect():
return np.random.randint(0, 2, (LENGTH)).astype(np.bool)
def HammingDist(vec1, vec2):
return np.sum(np.logical_xor(vec1, vec2))
def SmallestHamming(vec):
XorData = np.logical_xor(DATA, vec[np.newaxis, :])
Lengths = np.sum(XorData, axis=1)
return DATA[np.argmin(Lengths)] # returns first smallest
def main():
v1 = RandomVect()
v2 = SmallestHamming(v1)
print(HammingDist(v1, v2))
# oke, lets try make it faster... (using numpy.(un)packbits)
DATA2 = np.packbits(DATA, axis=1)
NBYTES = DATA2.shape[-1]
BYTE2ONES = np.zeros((256), dtype=np.uint8)
for i in range(0,256):
BYTE2ONES[i] = np.sum(np.unpackbits(np.uint8(i)))
def RandomVect2():
return np.packbits(RandomVect())
def HammingDist2(vec1, vec2):
v1 = np.unpackbits(vec1)
v2 = np.unpackbits(vec2)
return np.sum(np.logical_xor(v1, v2))
def SmallestHamming2(vec):
XorData = DATA2 ^ vec[np.newaxis, :]
Lengths = np.sum(BYTE2ONES[XorData], axis=1)
return DATA2[np.argmin(Lengths)] # returns first smallest
def main2():
v1 = RandomVect2()
v2 = SmallestHamming2(v1)
print(HammingDist2(v1, v2))
Use cdist from SciPy:
from scipy.spatial.distance import cdist
Y = cdist(XA, XB, 'hamming')
Computes the normalized Hamming distance, or the proportion of those vector elements between two n-vectors u and v which disagree. To save memory, the matrix X can be of type boolean
Reference: http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html
I would be surprised if there was a significantly faster way than this: Put your data into a pandas DataFrame (M), each vector by columns, and your target vector into a pandas Series (x),
import numpy as np
import pandas as pd
rows = 68
columns=3000
M = pd.DataFrame(np.random.rand(rows,columns)>0.5)
x = pd.Series(np.random.rand(rows)>0.5)
then do the following
%timeit M.apply(lambda y: x==y).astype(int).sum().idxmax()
1 loop, best of 3: 746 ms per loop
Edit: Actually, I am surprised this is a much faster way
%timeit M.eq(x, axis=0).astype(int).sum().idxmax()
100 loops, best of 3: 2.68 ms per loop
Short Question
I have a large 10000x10000 elements image, which I bin into a few hundred different sectors/bins. I then need to perform some iterative calculation on the values contained within each bin.
How do I extract the indices of each bin to efficiently perform my calculation using the bins values?
What I am looking for is a solution which avoids the bottleneck of having to select every time ind == j from my large array. Is there a way to obtain directly, in one go, the indices of the elements belonging to every bin?
Detailed Explanation
1. Straightforward Solution
One way to achieve what I need is to use code like the following (see e.g. THIS related answer), where I digitize my values and then have a j-loop selecting digitized indices equal to j like below
import numpy as np
# This function func() is just a placemark for a much more complicated function.
# I am aware that my problem could be easily sped up in the specific case of
# of the sum() function, but I am looking for a general solution to the problem.
def func(x):
y = np.sum(x)
return y
vals = np.random.random(1e8)
nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)
result = [func(vals[ind == j]) for j in range(1, nbins)]
This is not what I want as it selects every time ind == j from my large array. This makes this solution very inefficient and slow.
2. Using binned_statistics
The above approach turns out to be the same implemented in scipy.stats.binned_statistic, for the general case of a user-defined function. Using Scipy directly an identical output can be obtained with the following
import numpy as np
from scipy.stats import binned_statistics
vals = np.random.random(1e8)
results = binned_statistic(vals, vals, statistic=func, bins=100, range=[0, 1])[0]
3. Using labeled_comprehension
Another Scipy alternative is to use scipy.ndimage.measurements.labeled_comprehension. Using that function, the above example would become
import numpy as np
from scipy.ndimage import labeled_comprehension
vals = np.random.random(1e8)
nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)
result = labeled_comprehension(vals, ind, np.arange(1, nbins), func, float, 0)
Unfortunately also this form is inefficient and in particular, it has no speed advantage over my original example.
4. Comparison with IDL language
To further clarify, what I am looking for is a functionality equivalent to the REVERSE_INDICES keyword in the HISTOGRAM function of the IDL language HERE. Can this very useful functionality be efficiently replicated in Python?
Specifically, using the IDL language the above example could be written as
vals = randomu(s, 1e8)
nbins = 100
bins = [0:1:1./nbins]
h = histogram(vals, MIN=bins[0], MAX=bins[-2], NBINS=nbins, REVERSE_INDICES=r)
result = dblarr(nbins)
for j=0, nbins-1 do begin
jbins = r[r[j]:r[j+1]-1] ; Selects indices of bin j
result[j] = func(vals[jbins])
endfor
The above IDL implementation is about 10 times faster than the Numpy one, due to the fact that the indices of the bins do not have to be selected for every bin. And the speed difference in favour of the IDL implementation increases with the number of bins.
I found that a particular sparse matrix constructor can achieve the desired result very efficiently. It's a bit obscure but we can abuse it for this purpose. The function below can be used in nearly the same way as scipy.stats.binned_statistic but can be orders of magnitude faster
import numpy as np
from scipy.sparse import csr_matrix
def binned_statistic(x, values, func, nbins, range):
'''The usage is nearly the same as scipy.stats.binned_statistic'''
N = len(values)
r0, r1 = range
digitized = (float(nbins)/(r1 - r0)*(x - r0)).astype(int)
S = csr_matrix((values, [digitized, np.arange(N)]), shape=(nbins, N))
return [func(group) for group in np.split(S.data, S.indptr[1:-1])]
I avoided np.digitize because it doesn't use the fact that all bins are equal width and hence is slow, but the method I used instead may not handle all edge cases perfectly.
I assume that the binning, done in the example with digitize, cannot be changed. This is one way to go, where you do the sorting once and for all.
vals = np.random.random(1e4)
nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)
new_order = argsort(ind)
ind = ind[new_order]
ordered_vals = vals[new_order]
# slower way of calculating first_hit (first version of this post)
# _,first_hit = unique(ind,return_index=True)
# faster way:
first_hit = searchsorted(ind,arange(1,nbins-1))
first_hit.sort()
#example of using the data:
for j in range(nbins-1):
#I am using a plotting function for your f, to show that they cluster
plot(ordered_vals[first_hit[j]:first_hit[j+1]],'o')
The figure shows that the bins are actually clusters as expected:
You can halve the computation time by sorting the array first, then use np.searchsorted.
vals = np.random.random(1e8)
vals.sort()
nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)
results = [func(vals[np.searchsorted(ind,j,side='left'):
np.searchsorted(ind,j,side='right')])
for j in range(1,nbins)]
Using 1e8 as my test case, I go from 34 seconds of computation to about 17.
One efficient solution is using the numpy_indexed package (disclaimer: I am its author):
import numpy_indexed as npi
npi.group_by(ind).split(vals)
Pandas has a very fast grouping code (I think it's written in C), so if you don't mind loading the library you could do that :
import pandas as pd
pdata=pd.DataFrame({'vals':vals,'ind':ind})
resultsp = pdata.groupby('ind').sum().values
or more generally :
pdata=pd.DataFrame({'vals':vals,'ind':ind})
resultsp = pdata.groupby('ind').agg(func).values
Although the latter is slower for standard aggregation functions
(like sum, mean, etc)
This question is probably answered somewhere, but I cannot find where, so I will ask here:
I have a set of data consisting of several samples per timestep. So, I basically have two arrays, "times", which looks something like: (0,0,0,1,1,1,1,1,2,2,3,4,4,4,4,...) and my data which is the value for each time. Each timestep has a random number of samples. I would like to get the average value of the data at each timestep in an efficient manner.
I have prepared the following sample code to show what my data looks like. Basically, I am wondering if there is a more efficient way to write the "average_values" function.
import numpy as np
import matplotlib.pyplot as plt
def average_values(x,y):
unique_x = np.unique(x)
averaged_y = [np.mean(y[x==ux]) for ux in unique_x]
return unique_x, averaged_y
#generate our data
times = []
samples = []
#we have some timesteps:
for time in np.linspace(0,10,101):
#and a random number of samples at each timestep:
num_samples = np.random.random_integers(1,10)
for i in range(0,num_samples):
times.append(time)
samples.append(np.sin(time)+np.random.random()*0.5)
times = np.array(times)
samples = np.array(samples)
plt.plot(times,samples,'bo',ms=3,mec=None,alpha=0.5)
plt.plot(*average_values(times,samples),color='r')
plt.show()
Here is what it looks like:
A generic code to do this would do something as follows:
def average_values_bis(x, y):
unq_x, idx = np.unique(x, return_inverse=True)
count_x = np.bincount(idx)
sum_y = np.bincount(idx, weights=y)
return unq_x, sum_y / count_x
Adding the function above and following line for the plotting to your script
plt.plot(*average_values_bis(times, samples),color='g')
produces this output, with the red line hidden behind the green one:
But timing both approaches reveals the benefits of using bincount, a 30x speed-up:
%timeit average_values(times, samples)
100 loops, best of 3: 2.83 ms per loop
%timeit average_values_bis(times, samples)
10000 loops, best of 3: 85.9 us per loop
May I propose a pandas solution. It is highly recommended if you are going to be working with time series.
Create test data
import pandas as pd
import numpy as np
times = np.random.randint(0,10,size=50)
values = np.sin(times) + np.random.random_sample((len(times),))
s = pd.Series(values, index=times)
s.plot(linestyle='.', marker='o')
Calculate averages
avs = s.groupby(level=0).mean()
avs.plot()