Adding many series efficiently? - python

I have thousands of pd.Series items, and I just want to add them. They regard different time intervals, and I need to pad missing values with zeros. I tried
add_series = lambda a, b: a.add(b, fill_value=0).fillna(0)
result = reduce(add_series, all_my_items)
which takes more time than I would expect. Is there any way to speed this up significantly?

Using concat
pd.concat(all_my_items,axis=1).fillna(0).sum(axis=1)

You can drop down to NumPy via np.pad and np.vstack. For performance, if possible you should avoid regular Python methods when manipulating Pandas / NumPy objects.
The below solution assumes each series is aligned by index, i.e. the kth item of each series by position is comparable across series for each k.
np.random.seed(0)
m, n = 10**2, 10**4
S = [pd.Series(np.random.random(np.random.randint(0, m))) for _ in range(n)]
def combiner(arrs):
n = max(map(len, arrs))
L = [np.pad(i.values, (0, n-len(i)), 'constant') for i in arrs]
return np.vstack(L).sum(0)
res1 = pd.concat(L, axis=1).fillna(0).sum(axis=1)
res2 = pd.Series(combiner(S))
assert (res1 == res2).all()
%timeit pd.concat(L, axis=1).fillna(0).sum(axis=1) # 2.63 s per loop
%timeit pd.Series(combiner(S)) # 863 ms per loop

You can use pd.concat but with axis=0 and then groupby on level=0 such as:
pd.concat(all_my_items,axis=0).groupby(level=0).sum()
With all_my_items containing 1000 pd.Series of different lengths (e.g. between 2000 and 2500) and different time intervals such as:
import numpy as np
np.random.seed(0)
n = 1000 #number of series
#lengths of the series
len_ser = np.random.randint(2000, 2500, n)
# to pick a random start date
list_date = pd.date_range(start = pd.to_datetime('1980-01-01'), periods=15000).tolist()
# generate the list of pd.Series
all_my_items = [pd.Series(range(len_ser[i]),
index=pd.date_range(start=list_date[np.random.randint(0,15000,1)[0]],
periods=len_ser[i]))
for i in range(n)]
# Wen's solution
%timeit pd.concat(all_my_items,axis=1).fillna(0).sum(axis=1) #1.47 s ± 138 ms per loop
#this solution
%timeit pd.concat(all_my_items,axis=0).groupby(level=0).sum() #270 ms ± 11.3 ms
#verify same result
print (pd.concat(all_my_items,axis=1).fillna(0).sum(axis=1) ==
pd.concat(all_my_items,axis=0).groupby(level=0).sum()).all()) #True
So the result is the same and the operation is faster

Related

Is there a faster method than np.isin and np.where for large arrays?

I have a 1xN array A and a 2xM array B. I want to make two new 1xN arrays
a boolean one that checks whether the first column of B is in A
another one with entries i that are B[1,i] if B[0,i] is in A, and np.nan otherwise
Whatever method I use needs to be super fast as it’ll be called a lot. I can do the first part using this: Is there method faster than np.isin for large array?
But I’m stumped on a good way to do the second part. Here’s what I’ve got so far (adapting the code in the post above):
import numpy as np
import numba as nb
#nb.jit(parallel=True)
def isinvals(arr, vals):
n = len(arr)
result = np.full(n, False)
result_vals = np.full(n, np.nan)
set_vals = set(vals[0,:])
list_vals = list(vals[0,:])
for i in nb.prange(n):
if arr[i] in set_vals:
ind = list_vals.index(arr[i]) ## THIS LINE IS WAY TOO SLOW
result[i] = True
result_vals[i] = vals[1,ind]
return result, result_vals
N = int(1e5)
M = int(20e3)
num_arr = 100e3
num_vals = 20e3
num_types = 6
arr = np.random.randint(0, num_arr, N)
vals_col1 = np.random.randint(0, num_vals, M)
vals_col2 = np.random.randint(0, num_types, M)
vals = np.array([vals_col1, vals_col2])
%timeit result, result_vals = isinvals(arr,vals)
46.4 ms ± 3.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The line I've marked above (list_vals.index(arr[i])) is the slow part. If I don't use that I can make a super fast version:
#nb.jit(parallel=True)
def isinvals_cheating(arr, vals):
n = len(arr)
result = np.full(n, False)
result_vals = np.full(n, np.nan)
set_vals = set(vals[0,:])
list_vals = list(vals[0,:])
for i in nb.prange(n):
if arr[i] in set_vals:
ind = 0 ## TEMPORARILY SETTING TO 0 TO INDICATE SPEED DIFFERENCE
result[i] = True
result_vals[i] = vals[1,ind]
return result, result_vals
%timeit result, result_vals = isinvals_cheating(arr,vals)
1.13 ms ± 59.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
i.e. that single line is making it 40 times slower.
Any ideas? I've also tried using np.where() but it's even slower.
Assuming OP's solution gives the desired result since the question seems ambiguous for non-unique values ​​in vals[0, idx] with different corresponding values vals[1, idx]. A lookup table is faster, but needs len(arr) additional space.
#nb.njit # tested with numba 0.55.1
def isin_nb(arr, vals):
lookup = np.empty(len(arr), np.float32)
lookup.fill(np.nan)
lookup[vals[0, ::-1]] = vals[1, ::-1]
res_val = lookup[arr]
return ~np.isnan(res_val), res_val
With the example data used in the question
res, res_val = isin_nb(arr, vals)
# %timeit 1000 loops, best of 5: 294 µs per loop
Asserting equal results
np.testing.assert_equal(res, result)
np.testing.assert_equal(res_val, result_vals)

Spedup distance and summary computation between two HUGE multi-dimensional arrays in python

I have only a year of experience with using python. I would like to find summary statistics based on two multi-dimensional arrays DF_All and DF_On. Both have X,Y values. A function is created that computes distance as sqrt((X-X0)^2 + (Y-Y0)^2) and generates summaries as shown in the code below. My question is: Is there any way to make this code run faster? I would prefer a native python method but other strategies (like numba are also welcomed).
The example (toy) code below takes only 50 milliseconds to run on my windows-7 x64 desktop. But my DF_All has more than 10,000 rows and I need to do this calculation a huge number of times as well resulting in a huge execution time.
import numpy as np
import pandas as pd
import json, random
# create data
KY = ['ER','WD','DF']
DS = ['On','Off']
DF_All = pd.DataFrame({'KY': np.random.choice(KY,20,replace = True),
'DS': np.random.choice(DS,20,replace = True),
'X': random.sample(range(1,100),20),
'Y': random.sample(range(1,100),20)})
DF_On = DF_All[DF_All['DS']=='On']
# function
def get_values(DF_All,X = list(DF_On['X'])[0],Y = list(DF_On['Y'])[0]):
dist_vector = np.sqrt((DF_All['X'] - X)**2 + (DF_All['Y'] - Y)**2) # computes distance
DF_All = DF_All[dist_vector<35] # filters if distance is < 35
# print(DF_All.shape)
DS_summary = [sum(DF_All['DS']==x) for x in ['On','Off']] # get summary
KY_summary = [sum(DF_All['KY']==x) for x in ['ER','WD','DF']] # get summary
joined_summary = DS_summary + KY_summary # join two summary lists
return(joined_summary) # return
Array_On = DF_On.values.tolist() # convert to array then to list
Values = [get_values(DF_All,ZZ[2],ZZ[3]) for ZZ in Array_On] # list comprehension to get DS and KY summary for all rows of Array_On list
Array_Updated = [x + y for x,y in zip(Array_On,Values)] # appending the summary list to Array_On list
Array_Updated = pd.DataFrame(Array_Updated) # converting to pandas dataframe
print(Array_Updated)
Here's an approach making use of vectorization by getting rid of the looping there -
from scipy.spatial.distance import cdist
def get_values_vectorized(DF_All, Array_On):
a = DF_All[['X','Y']].values
b = np.array(Array_On)[:,2:].astype(int)
v_mask = (cdist(b,a) < 35).astype(int)
DF_DS = DF_All.DS.values
DS_sums = v_mask.dot(DF_DS[:,None] == ['On','Off'])
DF_KY = DF_All.KY.values
KY_sums = v_mask.dot(DF_KY[:,None] == ['ER','WD','DF'])
return np.column_stack(( DS_sums, KY_sums ))
Using a bit less memory, a tweaked one -
def get_values_vectorized_v2(DF_All, Array_On):
a = DF_All[['X','Y']].values
b = np.array(Array_On)[:,2:].astype(int)
v_mask = cdist(a,b) < 35
DF_DS = DF_All.DS.values
DS_sums = [((DF_DS==x)[:,None] & v_mask).sum(0) for x in ['On','Off']]
DF_KY = DF_All.KY.values
KY_sums = [((DF_KY==x)[:,None] & v_mask).sum(0) for x in ['ER','WD','DF']]
out = np.column_stack(( np.column_stack(DS_sums), np.column_stack(KY_sums)))
return out
Runtime test -
Case #1 : Original sample size of 20
In [417]: %timeit [get_values(DF_All,ZZ[2],ZZ[3]) for ZZ in Array_On]
100 loops, best of 3: 16.3 ms per loop
In [418]: %timeit get_values_vectorized(DF_All, Array_On)
1000 loops, best of 3: 386 µs per loop
Case #2: Sample size of 2000
In [420]: %timeit [get_values(DF_All,ZZ[2],ZZ[3]) for ZZ in Array_On]
1 loops, best of 3: 1.39 s per loop
In [421]: %timeit get_values_vectorized(DF_All, Array_On)
100 loops, best of 3: 18 ms per loop

why is len so much more efficient on DataFrame than on underlying numpy array?

I've noticed that using len on a DataFrame is far quicker than using len on the underlying numpy array. I don't understand why. Accessing the same information via shape isn't any help either. This is more relevant as I try to get at the number of columns and number of rows. I was always debating which method to use.
I put together the following experiment and it's very clear that I will be using len on the dataframe. But can someone explain why?
from timeit import timeit
import pandas as pd
import numpy as np
ns = np.power(10, np.arange(6))
results = pd.DataFrame(
columns=ns,
index=pd.MultiIndex.from_product(
[['len', 'len(values)', 'shape'],
ns]))
dfs = {(n, m): pd.DataFrame(np.zeros((n, m))) for n in ns for m in ns}
for n, m in dfs.keys():
df = dfs[(n, m)]
results.loc[('len', n), m] = timeit('len(df)', 'from __main__ import df', number=10000)
results.loc[('len(values)', n), m] = timeit('len(df.values)', 'from __main__ import df', number=10000)
results.loc[('shape', n), m] = timeit('df.values.shape', 'from __main__ import df', number=10000)
fig, axes = plt.subplots(2, 3, figsize=(9, 6), sharex=True, sharey=True)
for i, (m, col) in enumerate(results.iteritems()):
r, c = i // 3, i % 3
col.unstack(0).plot.bar(ax=axes[r, c], title=m)
From looking at the various methods, the main reason is that constructing the numpy array df.values takes the lion's share of the time.
len(df) and df.shape
These two are fast because they are essentially
len(df.index._data)
and
(len(df.index._data), len(df.columns._data))
where _data is a numpy.ndarray. Thus, using df.shape should be half as fast as len(df) because it's finding the length of both df.index and df.columns (both of type pd.Index)
len(df.values) and df.values.shape
Let's say you had already extracted vals = df.values. Then
In [1]: df = pd.DataFrame(np.random.rand(1000, 10), columns=range(10))
In [2]: vals = df.values
In [3]: %timeit len(vals)
10000000 loops, best of 3: 35.4 ns per loop
In [4]: %timeit vals.shape
10000000 loops, best of 3: 51.7 ns per loop
Compared to:
In [5]: %timeit len(df.values)
100000 loops, best of 3: 3.55 µs per loop
So the bottleneck is not len but how df.values is constructed. If you examine pandas.DataFrame.values(), you'll find the (roughly equivalent) methods:
def values(self):
return self.as_matrix()
def as_matrix(self, columns=None):
self._consolidate_inplace()
if self._AXIS_REVERSED:
return self._data.as_matrix(columns).T
if len(self._data.blocks) == 0:
return np.empty(self._data.shape, dtype=float)
if columns is not None:
mgr = self._data.reindex_axis(columns, axis=0)
else:
mgr = self._data
if self._data._is_single_block or not self._data.is_mixed_type:
return mgr.blocks[0].get_values()
else:
dtype = _interleaved_dtype(self.blocks)
result = np.empty(self.shape, dtype=dtype)
if result.shape[0] == 0:
return result
itemmask = np.zeros(self.shape[0])
for blk in self.blocks:
rl = blk.mgr_locs
result[rl.indexer] = blk.get_values(dtype)
itemmask[rl.indexer] = 1
# vvv here is your final array assuming you actually have data
return result
def _consolidate_inplace(self):
def f():
if self._data.is_consolidated():
return self._data
bm = self._data.__class__(self._data.blocks, self._data.axes)
bm._is_consolidated = False
bm._consolidate_inplace()
return bm
self._protect_consolidate(f)
def _protect_consolidate(self, f):
blocks_before = len(self._data.blocks)
result = f()
if len(self._data.blocks) != blocks_before:
if i is not None:
self._item_cache.pop(i, None)
else:
self._item_cache.clear()
return result
Note that df._data is a pandas.core.internals.BlockManager, not a numpy.ndarray.
If you look at __len__ for pd.DataFrame, they actually just call len(df.index):
https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L770
For a RangeIndex, this is a really fast operation since it's just a subtraction and division of values stored within the index object:
return max(0, -(-(self._stop - self._start) // self._step))
https://github.com/pandas-dev/pandas/blob/master/pandas/indexes/range.py#L458
I suspect that if you tested with a non-RangeIndex, the difference in times would be much more similar. I'll probably try modifying what you have to see if that's the case.
EDIT: After a quick check, the speed difference still seems to hold even with a standard Index, so there must still be some other optimization there.

Fastest way for a loop and condition code (Python + Dataframes)

I have the following loops that takes more than 9 seconds for 10 000 loops. For my program, I have to execute more than 1000 times this function. I need some help to optimize the "simu" function as from now my code is impossible to use since the time duration. For info, daterange values are only for example but can be very different from one to others.
What take mostly time :
df.itertuples(['DATES'])
loop even using iterator
if condition
f.index.get_loc to have the position of the date
Has someone any idea how to optimize this code ?
def simu(nbprod, df, daterange):
timer = time.time()
mat = np.zeros((len(df), nbprod))
iterator = ((i,j) for j in xrange(len(daterange)) for i in df.itertuples(['DATES']))
for (i,j) in iterator:
thedate = i[0]
if (thedate >= daterange[j][0]) and (thedate <= daterange[j][1]):
mat[df.index.get_loc(i[0])][j] = 1
print time.time() - timer
return mat
new_index = pd.date_range(start=pd.datetime(2014,1,1), periods=24*10000, freq='H')
df = pd.DataFrame(np.random.randn(len(new_index)), new_index)
df.index.name = 'DATES'
daterange = [[pd.datetime(2014,1,3), pd.datetime(2014,1,7)], [pd.datetime(2015,6,3), pd.datetime(2017,1,7)], [pd.datetime(2017,1,3), pd.datetime(2020,1,7)]]
### for 1 time
>>> simu(len(daterange), df, daterange)
9.43400001526
### for 3 times more
>>> simu(len(daterange)*3, df, daterange*3)
30.6919999123
>>> simu(len(daterange)*10, df, daterange*10)
92.2009999752
This returns a frame, which is IMHO more useful anyhow (if you want the underlying
data, just df.values. This will scale linearly with the length of daterange.
def simu2(df, daterange):
mat = pd.DataFrame(0,index=df.index,columns=range(len(daterange)))
for j, (d1,d2) in enumerate(daterange):
result = df[(df.index>=d1)&(df.index<=d2)]
mat.loc[result.index,j] = 1
return mat
In [7]: result1 = simu2(df, daterange)
In [10]: result2 = simu(len(daterange), df, daterange)
5.7844748497
In [11]: (result1.values==result2).all()
Out[11]: True
In [12]: %timeit simu2(df, daterange)
10 loops, best of 3: 162 ms per loop

A fast way to find the largest N elements in an numpy array

I know I can do it like the following:
import numpy as np
N=10
a=np.arange(1,100,1)
np.argsort()[-N:]
However, it is very slow since it did a full sort.
I wonder whether numpy provide some methods the do it fast.
numpy 1.8 implements partition and argpartition that perform partial sort ( in O(n) time as opposed to full sort that is O(n) * log(n)).
import numpy as np
test = np.array([9,1,3,4,8,7,2,5,6,0])
temp = np.argpartition(-test, 4)
result_args = temp[:4]
temp = np.partition(-test, 4)
result = -temp[:4]
Result:
>>> result_args
array([0, 4, 8, 5]) # indices of highest vals
>>> result
array([9, 8, 6, 7]) # highest vals
Timing:
In [16]: a = np.arange(10000)
In [17]: np.random.shuffle(a)
In [18]: %timeit np.argsort(a)
1000 loops, best of 3: 1.02 ms per loop
In [19]: %timeit np.argpartition(a, 100)
10000 loops, best of 3: 139 us per loop
In [20]: %timeit np.argpartition(a, 1000)
10000 loops, best of 3: 141 us per loop
The bottleneck module has a fast partial sort method that works directly with Numpy arrays: bottleneck.partition().
Note that bottleneck.partition() returns the actual values sorted, if you want the indexes of the sorted values (what numpy.argsort() returns) you should use bottleneck.argpartition().
I've benchmarked:
z = -bottleneck.partition(-a, 10)[:10]
z = a.argsort()[-10:]
z = heapq.nlargest(10, a)
where a is a random 1,000,000-element array.
The timings were as follows:
bottleneck.partition(): 25.6 ms per loop
np.argsort(): 198 ms per loop
heapq.nlargest(): 358 ms per loop
I had this problem and, since this question is 5 years old, I had to redo all benchmarks and change the syntax of bottleneck (there is no partsort anymore, it's partition now).
I used the same arguments as kwgoodman, except the number of elements retrieved, which I increased to 50 (to better fit my particular situation).
I got these results:
bottleneck 1: 01.12 ms per loop
bottleneck 2: 00.95 ms per loop
pandas : 01.65 ms per loop
heapq : 08.61 ms per loop
numpy : 12.37 ms per loop
numpy 2 : 00.95 ms per loop
So, bottleneck_2 and numpy_2 (adas's solution) were tied.
But, using np.percentile (numpy_2) you have those topN elements already sorted, which is not the case for the other solutions. On the other hand, if you are also interested on the indexes of those elements, percentile is not useful.
I added pandas too, which uses bottleneck underneath, if available (http://pandas.pydata.org/pandas-docs/stable/install.html#recommended-dependencies). If you already have a pandas Series or DataFrame to start with, you are in good hands, just use nlargest and you're done.
The code used for the benchmark is as follows (python 3, please):
import time
import numpy as np
import bottleneck as bn
import pandas as pd
import heapq
def bottleneck_1(a, n):
return -bn.partition(-a, n)[:n]
def bottleneck_2(a, n):
return bn.partition(a, a.size-n)[-n:]
def numpy(a, n):
return a[a.argsort()[-n:]]
def numpy_2(a, n):
M = a.shape[0]
perc = (np.arange(M-n,M)+1.0)/M*100
return np.percentile(a,perc)
def pandas(a, n):
return pd.Series(a).nlargest(n)
def hpq(a, n):
return heapq.nlargest(n, a)
def do_nothing(a, n):
return a[:n]
def benchmark(func, size=1000000, ntimes=100, topn=50):
t1 = time.time()
for n in range(ntimes):
a = np.random.rand(size)
func(a, topn)
t2 = time.time()
ms_per_loop = 1000000 * (t2 - t1) / size
return ms_per_loop
t1 = benchmark(bottleneck_1)
t2 = benchmark(bottleneck_2)
t3 = benchmark(pandas)
t4 = benchmark(hpq)
t5 = benchmark(numpy)
t6 = benchmark(numpy_2)
t0 = benchmark(do_nothing)
print("bottleneck 1: {:05.2f} ms per loop".format(t1 - t0))
print("bottleneck 2: {:05.2f} ms per loop".format(t2 - t0))
print("pandas : {:05.2f} ms per loop".format(t3 - t0))
print("heapq : {:05.2f} ms per loop".format(t4 - t0))
print("numpy : {:05.2f} ms per loop".format(t5 - t0))
print("numpy 2 : {:05.2f} ms per loop".format(t6 - t0))
Each negative sign in the proposed bottleneck solution
-bottleneck.partsort(-a, 10)[:10]
makes a copy of the data. We can remove the copies by doing
bottleneck.partsort(a, a.size-10)[-10:]
Also the proposed numpy solution
a.argsort()[-10:]
returns indices not values. The fix is to use the indices to find the values:
a[a.argsort()[-10:]]
The relative speed of the two bottleneck solutions depends on the ordering of the elements in the initial array because the two approaches partition the data at different points.
In other words, timing with any one particular random array can make either method look faster.
Averaging the timing across 100 random arrays, each with 1,000,000 elements, gives
-bn.partsort(-a, 10)[:10]: 1.76 ms per loop
bn.partsort(a, a.size-10)[-10:]: 0.92 ms per loop
a[a.argsort()[-10:]]: 15.34 ms per loop
where the timing code is as follows:
import time
import numpy as np
import bottleneck as bn
def bottleneck_1(a):
return -bn.partsort(-a, 10)[:10]
def bottleneck_2(a):
return bn.partsort(a, a.size-10)[-10:]
def numpy(a):
return a[a.argsort()[-10:]]
def do_nothing(a):
return a
def benchmark(func, size=1000000, ntimes=100):
t1 = time.time()
for n in range(ntimes):
a = np.random.rand(size)
func(a)
t2 = time.time()
ms_per_loop = 1000000 * (t2 - t1) / size
return ms_per_loop
t1 = benchmark(bottleneck_1)
t2 = benchmark(bottleneck_2)
t3 = benchmark(numpy)
t4 = benchmark(do_nothing)
print "-bn.partsort(-a, 10)[:10]: %0.2f ms per loop" % (t1 - t4)
print "bn.partsort(a, a.size-10)[-10:]: %0.2f ms per loop" % (t2 - t4)
print "a[a.argsort()[-10:]]: %0.2f ms per loop" % (t3 - t4)
Perhaps heapq.nlargest
import numpy as np
import heapq
x = np.array([1,-5,4,6,-3,3])
z = heapq.nlargest(3,x)
Result:
>>> z
[6, 4, 3]
If you want to find the indices of the n largest elements using bottleneck you could use
bottleneck.argpartsort
>>> x = np.array([1,-5,4,6,-3,3])
>>> z = bottleneck.argpartsort(-x, 3)[:3]
>>> z
array([3, 2, 5]
You can also use numpy's percentile function. In my case it was slightly faster then bottleneck.partsort():
import timeit
import bottleneck as bn
N,M,K = 10,1000000,100
start = timeit.default_timer()
for k in range(K):
a=np.random.uniform(size=M)
tmp=-bn.partsort(-a, N)[:N]
stop = timeit.default_timer()
print (stop - start)/K
start = timeit.default_timer()
perc = (np.arange(M-N,M)+1.0)/M*100
for k in range(K):
a=np.random.uniform(size=M)
tmp=np.percentile(a,perc)
stop = timeit.default_timer()
print (stop - start)/K
Average time per loop:
bottleneck.partsort(): 59 ms
np.percentile(): 54 ms
If storing the array as a list of numbers isn't problematic, you can use
import heapq
heapq.nlargest(N, a)
to get the N largest members.

Categories