Can this next for-loop be done any faster by pushing it down to numpy?
ri = numpy.zeros((R.shape[0],R.shape[2]))
for i in range(R.shape[0]):
ri[i, :] = R[i, indices[i], :]
This relates to my previous question making numpy.nanargmin return nan if column is all nan , which was to speed up this bit:
bestepsilons = numpy.zeros((R.shape[0]))
for i in range(R.shape[0]):
bestindex = numpy.nanargmin(R[i,:])
if(numpy.isnan(bestindex)):
bestepsilons[i]=numpy.nan
else:
bestepsilons[i]=epsilon[bestindex]
and got solved (by myself) as:
bestepsilons1 = numpy.zeros(R.shape[0])+numpy.nan
d0 = numpy.nanmin(R, axis=1) # places where the best index is not a nan
bestepsilons1[~numpy.isnan(d0)] = epsilon[numpy.nanargmin(R[~numpy.isnan(d0),:], axis=1)]
But now the more complicated case is:
bestepsilons = numpy.zeros((R.shape[0]))
for i in range(R.shape[0]):
bestindex = numpy.nanargmin(R[i,indices[i],:])
if(numpy.isnan(bestindex)):
bestepsilons[i]=numpy.nan
else:
bestepsilons[i]=epsilon[bestindex]
And now this trick to show the places where the best index is not a nan does not work anymore with that axis argument.
It is possible to push it down to numpy but whether or not it is faster will depend on the sizes of your arrays. Hopefully, there is a more elegant solution, but this works:
ii = np.arange(R.shape[0]) * R.shape[1] + indices
ri = R.reshape(-1, R.shape[2])[ii]
Here are a couple timing tests:
def f1(R, indices):
ri = numpy.zeros((R.shape[0],R.shape[2]))
for i in range(R.shape[0]):
ri[i, :] = R[i, indices[i], :]
return ri
def f2(R, indices):
ii = np.arange(R.shape[0]) * R.shape[1] + indices
return R.reshape(-1, R.shape[2])[ii]
Smaller R:
In [25]: R = np.random.rand(30, 40, 50)
In [26]: indices = np.random.choice(range(R.shape[1]), R.shape[0], replace=True)
In [27]: %timeit(f1(R, indices))
10000 loops, best of 3: 61.4 us per loop
In [28]: %timeit(f2(R, indices))
10000 loops, best of 3: 21.9 us per loop
Larger R:
In [29]: R = np.random.rand(300, 400, 500)
In [30]: indices = np.random.choice(range(R.shape[1]), R.shape[0], replace=True)
In [31]: %timeit(f1(R, indices))
1000 loops, best of 3: 713 us per loop
In [32]: %timeit(f2(R, indices))
1000 loops, best of 3: 1.23 ms per loop
In [33]: np.all(f1(R, indices) == f2(R, indices))
Out[33]: True
Found that this is faster by about 10%:
d1 = numpy.arange(R.shape[0])[:,None]
d2 = indices[numpy.arange(R.shape[0])][:,None]
d3 = numpy.arange(R.shape[2])[None,:]
ri = R[d1,d2,d3]
bestepsilons = numpy.zeros(R.shape[0])+numpy.nan
d0 = numpy.nanmin(ri, axis=1) # places where the best index is not a nan
bestepsilons[~numpy.isnan(d0)] = epsilon[numpy.nanargmin(ri[~numpy.isnan(d0),:], axis=1)]
But this is with R defined as:
R = (self.VVm[:,None,None]-VVs[None,:,:])**2 + (self.HHm[:,None,None]-HHs[None,:,:])**2
and i found that if i define R differently it speeds up a huge load more:
ti = indices[numpy.arange(len(VVm))]
R1 = (VVm[:,None]-VVs[ti,:])**2+(HHm[:,None]-HHs[ti,:])**2
d0 = numpy.nanmin(R1, axis=1) # places where the best index is not a nan
bestepsilons2[~numpy.isnan(d0)] = epsilon[numpy.nanargmin(R1[~numpy.isnan(d0),:], axis=1)]
This way it does not need to make a 3D R, but makes it directly in 2D (gets a 4X speedup)
Related
I have thousands of pd.Series items, and I just want to add them. They regard different time intervals, and I need to pad missing values with zeros. I tried
add_series = lambda a, b: a.add(b, fill_value=0).fillna(0)
result = reduce(add_series, all_my_items)
which takes more time than I would expect. Is there any way to speed this up significantly?
Using concat
pd.concat(all_my_items,axis=1).fillna(0).sum(axis=1)
You can drop down to NumPy via np.pad and np.vstack. For performance, if possible you should avoid regular Python methods when manipulating Pandas / NumPy objects.
The below solution assumes each series is aligned by index, i.e. the kth item of each series by position is comparable across series for each k.
np.random.seed(0)
m, n = 10**2, 10**4
S = [pd.Series(np.random.random(np.random.randint(0, m))) for _ in range(n)]
def combiner(arrs):
n = max(map(len, arrs))
L = [np.pad(i.values, (0, n-len(i)), 'constant') for i in arrs]
return np.vstack(L).sum(0)
res1 = pd.concat(L, axis=1).fillna(0).sum(axis=1)
res2 = pd.Series(combiner(S))
assert (res1 == res2).all()
%timeit pd.concat(L, axis=1).fillna(0).sum(axis=1) # 2.63 s per loop
%timeit pd.Series(combiner(S)) # 863 ms per loop
You can use pd.concat but with axis=0 and then groupby on level=0 such as:
pd.concat(all_my_items,axis=0).groupby(level=0).sum()
With all_my_items containing 1000 pd.Series of different lengths (e.g. between 2000 and 2500) and different time intervals such as:
import numpy as np
np.random.seed(0)
n = 1000 #number of series
#lengths of the series
len_ser = np.random.randint(2000, 2500, n)
# to pick a random start date
list_date = pd.date_range(start = pd.to_datetime('1980-01-01'), periods=15000).tolist()
# generate the list of pd.Series
all_my_items = [pd.Series(range(len_ser[i]),
index=pd.date_range(start=list_date[np.random.randint(0,15000,1)[0]],
periods=len_ser[i]))
for i in range(n)]
# Wen's solution
%timeit pd.concat(all_my_items,axis=1).fillna(0).sum(axis=1) #1.47 s ± 138 ms per loop
#this solution
%timeit pd.concat(all_my_items,axis=0).groupby(level=0).sum() #270 ms ± 11.3 ms
#verify same result
print (pd.concat(all_my_items,axis=1).fillna(0).sum(axis=1) ==
pd.concat(all_my_items,axis=0).groupby(level=0).sum()).all()) #True
So the result is the same and the operation is faster
Given the Fourier series coefficients a[n] and b[n] (for cosines and sines respectively) of a function with period T and t an equally spaced interval the following code will evaluate the partial sum for all points in interval t (a,b,t are all numpy arrays). It is clarified that len(t) <> len(a).
yn=ones(len(t))*a[0]
for n in range(1,len(a)):
yn=yn+(a[n]*cos(2*pi*n*t/T)-b[n]*sin(2*pi*n*t/T))
My question is: Can this for loop be vectorized?
Here's one vectorized approach making use broadcasting to create the 2D array version of cosine/sine input : 2*pi*n*t/T and then using matrix-multiplication with np.dot for the sum-reduction -
r = np.arange(1,len(a))
S = 2*np.pi*r[:,None]*t/T
cS = np.cos(S)
sS = np.sin(S)
out = a[1:].dot(cS) - b[1:].dot(sS) + a[0]
Further performance boost
For further boost, we can make use of numexpr module to compute those trignometric steps -
import numexpr as ne
cS = ne.evaluate('cos(S)')
sS = ne.evaluate('sin(S)')
Runtime test -
Approaches -
def original_app(t,a,b,T):
yn=np.ones(len(t))*a[0]
for n in range(1,len(a)):
yn=yn+(a[n]*np.cos(2*np.pi*n*t/T)-b[n]*np.sin(2*np.pi*n*t/T))
return yn
def vectorized_app(t,a,b,T):
r = np.arange(1,len(a))
S = (2*np.pi/T)*r[:,None]*t
cS = np.cos(S)
sS = np.sin(S)
return a[1:].dot(cS) - b[1:].dot(sS) + a[0]
def vectorized_app_v2(t,a,b,T):
r = np.arange(1,len(a))
S = (2*np.pi/T)*r[:,None]*t
cS = ne.evaluate('cos(S)')
sS = ne.evaluate('sin(S)')
return a[1:].dot(cS) - b[1:].dot(sS) + a[0]
Also, including function PP from #Paul Panzer's post.
Timings -
In [22]: # Setup inputs
...: n = 10000
...: t = np.random.randint(0,9,(n))
...: a = np.random.randint(0,9,(n))
...: b = np.random.randint(0,9,(n))
...: T = 3.45
...:
In [23]: print np.allclose(original_app(t,a,b,T), vectorized_app(t,a,b,T))
...: print np.allclose(original_app(t,a,b,T), vectorized_app_v2(t,a,b,T))
...: print np.allclose(original_app(t,a,b,T), PP(t,a,b,T))
...:
True
True
True
In [25]: %timeit original_app(t,a,b,T)
...: %timeit vectorized_app(t,a,b,T)
...: %timeit vectorized_app_v2(t,a,b,T)
...: %timeit PP(t,a,b,T)
...:
1 loops, best of 3: 6.49 s per loop
1 loops, best of 3: 6.24 s per loop
1 loops, best of 3: 1.54 s per loop
1 loops, best of 3: 1.96 s per loop
Can't beat numexpr, but if it's not available we can save on the transcendentals (testing and benchmarking code heavily based on #Divakar's code in case you didn't notice ;-) ):
import numpy as np
from timeit import timeit
def PP(t,a,b,T):
CS = np.empty((len(t), len(a)-1), np.complex)
CS[...] = np.exp(2j*np.pi*(t[:, None])/T)
np.cumprod(CS, axis=-1, out=CS)
return a[1:].dot(CS.T.real) - b[1:].dot(CS.T.imag) + a[0]
def original_app(t,a,b,T):
yn=np.ones(len(t))*a[0]
for n in range(1,len(a)):
yn=yn+(a[n]*np.cos(2*np.pi*n*t/T)-b[n]*np.sin(2*np.pi*n*t/T))
return yn
def vectorized_app(t,a,b,T):
r = np.arange(1,len(a))
S = 2*np.pi*r[:,None]*t/T
cS = np.cos(S)
sS = np.sin(S)
return a[1:].dot(cS) - b[1:].dot(sS) + a[0]
n = 1000
t = 2000
t = np.random.randint(0,9,(t))
a = np.random.randint(0,9,(n))
b = np.random.randint(0,9,(n))
T = 3.45
print(np.allclose(original_app(t,a,b,T), vectorized_app(t,a,b,T)))
print(np.allclose(original_app(t,a,b,T), PP(t,a,b,T)))
print('{:18s} {:9.6f}'.format('orig', timeit(lambda: original_app(t,a,b,T), number=10)/10))
print('{:18s} {:9.6f}'.format('Divakar no numexpr', timeit(lambda: vectorized_app(t,a,b,T), number=10)/10))
print('{:18s} {:9.6f}'.format('PP', timeit(lambda: PP(t,a,b,T), number=10)/10))
Prints:
True
True
orig 0.166903
Divakar no numexpr 0.179617
PP 0.060817
Btw. if delta t divides T one can potentially save more, or even run the full fft and discard what's too much.
This is not really another answer but a comment on #Paul Panzer's one, written as an answer because I needed to post some code. If there is a way to post propely formatted code in a comment please advice.
Inspired by #Paul Panzer cumprod idea, I came up with the following:
an = ones((len(a)-1,len(te)))*2j*pi*te/T
CS = exp(cumsum(an,axis=0))
out = (a[1:].dot(CS.real) - b[1:].dot(CS.imag)) + a[0]
Although it seems properly vectorized and produces correct results, its performance is miserable. It is not only much slower than the cumprod, which is expected as len(a)-1 exponentiations more are made, but 50% slower than the original unvectorized version. What is the cause of this poor performance?
I have these 2 vectors A and B:
import numpy as np
A=np.array([1,2,3])
B=np.array([8,7])
and I want to add them up with this expression:
Result = sum((A-B)**2)
The expected result that I need is:
Result = np.array([X,Y])
Where:
X = (1-8)**2 + (2-8)**2 + (3-8)**2 = 110
Y = (1-7)**2 + (2-7)**2 + (3-7)**2 = 77
How can I do it? The 2 arrays are an example, in my case I have a very large arrays and I cannot do it manually.
You can make A a 2d array and utilize numpy's broadcasting property to vectorize the calculation:
((A[:, None] - B) ** 2).sum(0)
# array([110, 77])
Since you have mentioned that you are working with large arrays, with focus on performance here's one with np.einsum that does the combined operation of squaring and sum-reduction in one step efficiently, like so -
def einsum_based(A,B):
subs = A[:,None] - B
return np.einsum('ij,ij->j',subs, subs)
Sample run -
In [16]: A = np.array([1,2,3])
...: B = np.array([8,7])
...:
In [17]: einsum_based(A,B)
Out[17]: array([110, 77])
Runtime test with large arrays scaling up the given sample 1000x -
In [8]: A = np.random.rand(3000)
In [9]: B = np.random.rand(2000)
In [10]: %timeit ((A[:, None] - B) ** 2).sum(0) # #Psidom's soln
10 loops, best of 3: 21 ms per loop
In [11]: %timeit einsum_based(A,B)
100 loops, best of 3: 12.3 ms per loop
I've noticed that using len on a DataFrame is far quicker than using len on the underlying numpy array. I don't understand why. Accessing the same information via shape isn't any help either. This is more relevant as I try to get at the number of columns and number of rows. I was always debating which method to use.
I put together the following experiment and it's very clear that I will be using len on the dataframe. But can someone explain why?
from timeit import timeit
import pandas as pd
import numpy as np
ns = np.power(10, np.arange(6))
results = pd.DataFrame(
columns=ns,
index=pd.MultiIndex.from_product(
[['len', 'len(values)', 'shape'],
ns]))
dfs = {(n, m): pd.DataFrame(np.zeros((n, m))) for n in ns for m in ns}
for n, m in dfs.keys():
df = dfs[(n, m)]
results.loc[('len', n), m] = timeit('len(df)', 'from __main__ import df', number=10000)
results.loc[('len(values)', n), m] = timeit('len(df.values)', 'from __main__ import df', number=10000)
results.loc[('shape', n), m] = timeit('df.values.shape', 'from __main__ import df', number=10000)
fig, axes = plt.subplots(2, 3, figsize=(9, 6), sharex=True, sharey=True)
for i, (m, col) in enumerate(results.iteritems()):
r, c = i // 3, i % 3
col.unstack(0).plot.bar(ax=axes[r, c], title=m)
From looking at the various methods, the main reason is that constructing the numpy array df.values takes the lion's share of the time.
len(df) and df.shape
These two are fast because they are essentially
len(df.index._data)
and
(len(df.index._data), len(df.columns._data))
where _data is a numpy.ndarray. Thus, using df.shape should be half as fast as len(df) because it's finding the length of both df.index and df.columns (both of type pd.Index)
len(df.values) and df.values.shape
Let's say you had already extracted vals = df.values. Then
In [1]: df = pd.DataFrame(np.random.rand(1000, 10), columns=range(10))
In [2]: vals = df.values
In [3]: %timeit len(vals)
10000000 loops, best of 3: 35.4 ns per loop
In [4]: %timeit vals.shape
10000000 loops, best of 3: 51.7 ns per loop
Compared to:
In [5]: %timeit len(df.values)
100000 loops, best of 3: 3.55 µs per loop
So the bottleneck is not len but how df.values is constructed. If you examine pandas.DataFrame.values(), you'll find the (roughly equivalent) methods:
def values(self):
return self.as_matrix()
def as_matrix(self, columns=None):
self._consolidate_inplace()
if self._AXIS_REVERSED:
return self._data.as_matrix(columns).T
if len(self._data.blocks) == 0:
return np.empty(self._data.shape, dtype=float)
if columns is not None:
mgr = self._data.reindex_axis(columns, axis=0)
else:
mgr = self._data
if self._data._is_single_block or not self._data.is_mixed_type:
return mgr.blocks[0].get_values()
else:
dtype = _interleaved_dtype(self.blocks)
result = np.empty(self.shape, dtype=dtype)
if result.shape[0] == 0:
return result
itemmask = np.zeros(self.shape[0])
for blk in self.blocks:
rl = blk.mgr_locs
result[rl.indexer] = blk.get_values(dtype)
itemmask[rl.indexer] = 1
# vvv here is your final array assuming you actually have data
return result
def _consolidate_inplace(self):
def f():
if self._data.is_consolidated():
return self._data
bm = self._data.__class__(self._data.blocks, self._data.axes)
bm._is_consolidated = False
bm._consolidate_inplace()
return bm
self._protect_consolidate(f)
def _protect_consolidate(self, f):
blocks_before = len(self._data.blocks)
result = f()
if len(self._data.blocks) != blocks_before:
if i is not None:
self._item_cache.pop(i, None)
else:
self._item_cache.clear()
return result
Note that df._data is a pandas.core.internals.BlockManager, not a numpy.ndarray.
If you look at __len__ for pd.DataFrame, they actually just call len(df.index):
https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L770
For a RangeIndex, this is a really fast operation since it's just a subtraction and division of values stored within the index object:
return max(0, -(-(self._stop - self._start) // self._step))
https://github.com/pandas-dev/pandas/blob/master/pandas/indexes/range.py#L458
I suspect that if you tested with a non-RangeIndex, the difference in times would be much more similar. I'll probably try modifying what you have to see if that's the case.
EDIT: After a quick check, the speed difference still seems to hold even with a standard Index, so there must still be some other optimization there.
I am trying to avoid using for-loop with numpy array.
If I have a code that looks like below:
psimaps = [np.zeros((10,10)) for i in range(len(features)-1)]
for k in range(len(features)-1):
if k != len(features)-2:
psimaps[k] = np.reshape(np.sum(featureParams*features[k], axis=1), (10,1)) + transitiveParams
else:
psimaps[k] = np.reshape(np.sum(featureParams*features[k], axis=1), (10,1)) + (np.sum(featureParams * features[k+1], axis=1)) + transitiveParams
return psimaps
How do I change this into a list representation to do this operation without a for loop? Thanks.
I added an original code. Basically, I'm generating new array computing from two arrays.
Basically all you need to do is broadcast your features array to your Params arrays. This can be done by inserting two new axes at the end of features (or more, if the Params arrays are not 2d). Note that I used keepdims instead of the reshaping after the sum.
psimaps = np.sum(featureParams*features[..., None, None], axis=2, keepdims=True) + transitiveParams
After you do the above, you have to add the last two rows together, then remove the last row, since you had that strange end of loop thing:
psimaps[-2] += psimaps[-1] - transitiveParams
psimaps = psimaps[:-1]
By the way, I first had to simplify your original loop before I could understand it. I'll leave my simplified version here for any interest:
Fake data (and my assumption of shapes)
size = 30
features = np.random.rand(50)
transitiveParams = np.random.rand(size, size)
featureParams = np.random.rand(size, size)
Original code by OP
psimaps_OP = [np.zeros((size,size)) for i in range(len(features)-1)]
for k in range(len(features)-1):
if k != len(features)-2:
psimaps_OP[k] = np.reshape(np.sum(featureParams*features[k], axis=1), (size,1)) + transitiveParams
else:
psimaps_OP[k] = np.reshape(np.sum(featureParams*features[k], axis=1), (size,1)) + (np.sum(featureParams * features[k+1], axis=1)) + transitiveParams
simplified:
psimaps_simp = np.zeros((len(features)-1, size, size))
for k in range(len(features)-1):
psimaps_simp[k] = np.sum(featureParams*features[k], axis=1, keepdims=True) + transitiveParams
psimaps_simp[-1] += np.sum(featureParams*features[-1], axis=1)
list comp:
psimaps_comp = [np.sum(featureParams*features[k], axis=1, keepdims=True) + transitiveParams for k in xrange(len(features)-1)]
psimaps_comp[-1] += np.sum(featureParams*features[-1], axis=1)
vectorised:
psimaps_vec = np.sum(featureParams*features[..., None, None], axis=2, keepdims=True) + transitiveParams
psimaps_vec[-2] += psimaps_vec[-1] - transitiveParams
psimaps_vec = psimaps_vec[:-1]
Next, check to make sure they all give the same result:
assert np.allclose(psimaps_simp, psimaps_OP), "simplification failed"
assert np.allclose(psimaps_simp, psimaps_vec), "vectorization failed"
Finally, timings:
#OP
100 loops, best of 3: 1.99 ms per loop
#simplified:
1000 loops, best of 3: 1.94 ms per loop
#list comp:
1000 loops, best of 3: 1.63 ms per loop
#vectorised:
1000 loops, best of 3: 407 µs per loop
If initialization is not important, maybe you can do like that:
psimaps = [ featureParams + transitiveParams for k in xrange(1,10)]
For each k, the sum featureParams + transitiveParams will be executed.