How can I optimize/vectorize this looped assignment on a DataFrame? - python

Below is a function I wrote to label certain rows based on ranges of indexes. For convenience, I'm making the two function arguments, samples and matdat available for download in pickle format.
from operator import itemgetter
from itertools import izip, imap
import pandas as pd
def _insert_design_columns(samples, matdat):
"""Add columns for design-factors, label lines that correspond to a given trials and
then fill in said columns with the appropriate value on lines that belong to a
trial.
samples : DataFrame
DataFrame of eyetracker samples.
column `t`: time sample, in ms
column `event`: TTL event
columns x, y: x and y coordinates of gaze
column cr: corneal reflection area
matdat : dict of numpy arrays
dict mapping matlab variable name to numpy array
returns : modified `samples` dataframe
"""
## This is fairly trivial preperation and data formatting for the nested
# for-loop below. We're just fixing types, adding empty columns, and
# ensuring that our numpy arrays have the right shape.
# Grab variables from the dict & squeeze the numpy arrays
key = ('cuepos', 'targetpos', 'targetorientation', 'soa', 'normalizedResp')
cpos, tpos, torient, soa, resp = map(pd.np.squeeze, imap(matdat.get, key))
cpos = cpos.astype(float)
cpos[cpos < 0] = pd.np.nan
cong = tpos == cpos
cong[pd.isnull(cpos)] = pd.np.nan
# Add empty columns for each factor. These will contain the factor level on
# that correspond to a trial (i.e. between a `TrialStart` and `ReportCueOnset` in
# `samples.event`
samples['soa'] = pd.np.nan
samples['cpos'] = pd.np.nan
samples['tpos'] = pd.np.nan
samples['cong'] = pd.np.nan
samples['torient'] = pd.np.nan
samples['normalizedResp'] = pd.np.nan
## This is important, but not the part we need to optimize.
# Here, we're finding the start and end indexes for every trial. Trials
# are composed of continuous slices of rows.
# Assign trial numbers
tstart = samples[samples.event == 'TrialStart'].t # each trial starts on a `TrialStart`
tstop = samples[samples.event == 'ReportCueOnset'].t # ... and ends on a `ReportCueOnset`
samples['trial'] = pd.np.nan # make an empty column which will contain trial num
## This is the sub-optimal part. Here, we're iterating through our start/end index
# pairs, slicing the dataframe to get the rows we need, and then:
# 1. Assigning a trial number to that slice of rows
# 2. Assigning the correct value to corresponding columns (see `factor_names`)
samples.set_index(['t'], inplace=True)
for i, (start, stop) in enumerate(izip(tstart, tstop)):
samples.loc[start:stop, 'trial'] = i + 1 # label the interval's trial number
# Now that we've labeled a range of rows as a trial, we can add factor levels
# to the corresponding columns
idx = itemgetter(i - 1)
# factor_values/names has the same length as the number of trials we're going to
# find. Get the corresponding value for the current trial so that we can
# assign it.
factor_values = imap(idx, (cpos, tpos, torient, soa, resp, cong))
factor_names = ('cpos', 'tpos', 'torient', 'soa', 'resp', 'cong')
for c, v in izip(factor_names, factor_values): # loop through columns and assign
samples.loc[start:stop, c] = v
samples.reset_index(inplace=True)
return samples
I've performed a %prun, the first few lines of which read:
548568 function calls (547462 primitive calls) in 9.380 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
11360 6.074 0.001 6.084 0.001 index.py:604(__contains__)
2194 0.949 0.000 0.949 0.000 {method 'copy' of 'numpy.ndarray' objects}
1430 0.730 0.001 0.730 0.001 {pandas.lib.infer_dtype}
1098 0.464 0.000 0.467 0.000 internals.py:277(set)
1093/1092 0.142 0.000 9.162 0.008 indexing.py:157(_setitem_with_indexer)
1100 0.106 0.000 1.266 0.001 frame.py:1851(__setitem__)
166 0.047 0.000 0.047 0.000 {method 'astype' of 'numpy.ndarray' objects}
107209 0.037 0.000 0.066 0.000 {isinstance}
14 0.029 0.002 0.029 0.002 {numpy.core.multiarray.concatenate}
39362/38266 0.026 0.000 6.101 0.000 {getattr}
7829/7828 0.024 0.000 0.030 0.000 {numpy.core.multiarray.array}
1092 0.023 0.000 0.457 0.000 internals.py:564(setitem)
5 0.023 0.005 0.023 0.005 {pandas.algos.take_2d_axis0_float64_float64}
4379 0.021 0.000 0.108 0.000 index.py:615(__getitem__)
1101 0.020 0.000 0.582 0.001 frame.py:1967(_sanitize_column)
2192 0.017 0.000 0.946 0.000 internals.py:2236(apply)
8 0.017 0.002 0.017 0.002 {method 'repeat' of 'numpy.ndarray' objects}
Judging by the line that reads 1093/1092 0.142 0.000 9.162 0.008 indexing.py:157(_setitem_with_indexer), I strongly suspect my nested loop assignment with loc to be the culprit. The whole function takes about 9.3 seconds to execute and has to be performed 144 times in total (i.e. ~22 minutes).
Is there a way to vectorize or otherwise optimize the assignment I'm trying to do?

Related

group by ,multiindex, timeseris zscore calcuation slow (17 seconds) , need you clue to speed up

I am facing a performance issue of pandas rolling(expanding) 10 years history record zscore calculation. It is too slow
for single recent day zscore, it need 17seconds
for calculate to whole history, it need around 30 minutes.(I has already resample this history record to weekly level to downsize to total record.
If you are any advise to speed up my lastz function, pls feel free to share you idea.
Here is the detail.
1. Data set. a 10 years stock record which has been resampled to balance the size & accuracy.
Total size is (207376, 8)
which covered about 500 index data for last 10 years. Here is the sample:
> Close PB1 PB2 PE1 PE2 TurnoverValue TurnoverVol ROE
>ticker tradeDate
>000001 2007-01-07 2678.526489 3.38135 2.87570 34.423700 61.361549 7.703712e+10 1.131558e+10 0.098227
>2007-01-14 2755.759814 3.45878 3.09090 35.209019 66.407800 7.897185e+10 1.116473e+10 0.098236
>2007-01-21 2796.761572 3.49394 3.31458 35.561800 70.449658 8.416415e+10 1.129387e+10 0.098250
I want to analyze the zscore changing in history and to forecast to future.
So, lastz function defined as below
The functions need speed up:
ts_start=pd.to_date("20180831")
#numba.jit
def lastz(x):
if x.index.max()[1]<ts_start:
return np.nan
else:
freedom = 1 # it is sample, so the sample std degree of freedome should not be 0 but 1
nlimit_interpolate = int(len(x)/100) #1% fill allowed
#print(nlimit_interpolate, len(x))
x=x.interpolate(limit=nlimit_interpolate+1 ) # plus 1 in case of 0 or minus
x=x.loc[x.notnull()]
Arry=x.values
zscore = stats.zmap(Arry[-1],Arry,ddof=freedom)
return zscore
weekly = weekly.sort_index()
%prun -s cumtime result = weekly.groupby(level="ticker").agg(lastz)
Here is the prun results for single calling:
13447048 function calls (13340521 primitive calls) in 17.183 seconds
Ordered by: cumulative time
> ncalls tottime percall cumtime percall
> filename:lineno(function)
> 1 0.000 0.000 17.183 17.183 {built-in method builtins.exec}
> 1 0.000 0.000 17.183 17.183 <string>:1(<module>)
> 1 0.000 0.000 17.176 17.176 groupby.py:4652(aggregate)
> 1 0.000 0.000 17.176 17.176 groupby.py:4086(aggregate)
> 1 0.000 0.000 17.176 17.176 base.py:562(_aggregate_multiple_funcs)
> 16/8 0.000 0.000 17.171 2.146 groupby.py:3471(aggregate)
> 8 0.000 0.000 17.171 2.146 groupby.py:3513(_aggregate_multiple_funcs)
> 8 0.000 0.000 17.147 2.143 groupby.py:1060(_python_agg_general)
> 8 0.000 0.000 17.145 2.143 groupby.py:2668(agg_series)
> 8 0.172 0.022 17.145 2.143 groupby.py:2693(_aggregate_series_pure_python)
> 4400 0.066 0.000 15.762 0.004 groupby.py:1062(<lambda>)
> 4400 0.162 0.000 14.255 0.003 <ipython-input-10-fdb784c8abd8>:15(lastz)
> 4400 0.035 0.000 8.982 0.002 base.py:807(max)
> 4400 0.070 0.000 7.955 0.002 multi.py:807(values)
> 4400 0.017 0.000 6.406 0.001 datetimes.py:976(astype)
> 4400 0.007 0.000 6.316 0.001 datetimelike.py:1130(astype)
> 4400 0.030 0.000 6.301 0.001 datetimelike.py:368(_box_values_as_index)
> 4400 0.009 0.000 5.613 0.001 datetimelike.py:362(_box_values)
> 4400 0.860 0.000 5.602 0.001 {pandas._libs.lib.map_infer} 1659008 4.278 0.000 4.741
> 0.000 datetimes.py:606(<lambda>)
> 4328 0.096 0.000 1.774 0.000 generic.py:5980(interpolate)
> 4336 0.015 0.000 1.696 0.000 indexing.py:1463(__getitem__)
> 4328 0.028 0.000 1.675 0.000 indexing.py:1854(_getitem_axis)
I was wondering if the datatime compare call too frequency and at better method to skip those calculated result. I calc the result weekly. So, last week data has already on hand no need to calculated again. the index.max()[1] was used to check if the dataset is later than certain day. If newer, calculated, otherwise , just return nan.
if I used rolling or expanding mode, half hour or 2 hour will be need to get the result.
Appreciate any idea or clue to speed up the function.
timeit result of different index method speed in pandas multiindex
I change the index selection method to save 6 seconds for each single calculation.
However the total running time still too long to accepted. need you clue to optimize it.

Optimize function slicing numpy arrays

I have the following function, which takes a numpy array of floats and an integer as its arguments. Each row in the array 'counts' is the result of some experiment, and I want to randomly draw a list of the experiments and add them up, then repeat this process to create lots of samples groups.
def my_function(counts,nSamples):
''' Create multiple randomly drawn (with replacement)
samples from the raw data '''
nSat,nRegions = counts.shape
sampleData = np.zeros((nSamples,nRegions))
for i in range(nSamples):
rc = np.random.randint(0,nSat,size=nSat)
sampleData[i] = counts[rc].sum(axis=0)
return sampleData
This function seems quite slow, typically counts has around 100,000 rows (and 4 columns) and nSamples is around 2000. I have tried using numba and implicit for loops to try and speed up this code with no success.
What are some other methods to try and increase the speed?
I have run cProfile on the function and got the following output.
8005 function calls in 60.208 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 60.208 60.208 <string>:1(<module>)
2000 0.010 0.000 13.306 0.007 _methods.py:31(_sum)
1 40.950 40.950 60.208 60.208 optimize_bootstrap.py:25(bootstrap)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
2000 5.938 0.003 5.938 0.003 {method 'randint' of 'mtrand.RandomState' objects}
2000 13.296 0.007 13.296 0.007 {method 'reduce' of 'numpy.ufunc' objects}
2000 0.015 0.000 13.321 0.007 {method 'sum' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 {numpy.core.multiarray.zeros}
1 0.000 0.000 0.000 0.000 {range}
Are you sure that
rc = np.random.randint(0,nSat,size=nSat)
is what you want, instead of size=someconstant? Otherwise you're summing over all the rows with many repeats.
edit
does it help to replace the slicing altogether with a matrix product:
rcvec=np.zeros(nSat,np.int)
for i in rc:
rcvec[i]+=1
sampleData[i] = rcvec.dot(counts)
(maybe there is a function in numpy that can give you rcvec faster)
Simply generate all indices in one go with a 2D size for np.random.randint, use those to index into counts array and then sum along the first axis, just like you were doing with the loopy one.
Thus, one vectorized way and as such faster one, would be like so -
RC = np.random.randint(0,nSat,size=(nSat, nSamples))
sampleData_out = counts[RC].sum(axis=0)

Increase the speed of my code

i have created the code below, it takes a series of values,
and generates 10 numbers between x and r with an average value of 8000
in order to meet the specification to cover the range as well as possible, I also calculated the standard deviation, which is a good measure of spread. So whenever a sample set meets the criteria of mean of 8000, I compared it to previous matches and constantly choose the samples that have the highest std dev (mean always = 8000)
def node_timing(average_block_response_computational_time, min_block_response_computational_time, max_block_response_computational_time):
sample_count = 10
num_of_trials = 1
# print average_block_response_computational_time
# print min_block_response_computational_time
# print max_block_response_computational_time
target_sum = sample_count * average_block_response_computational_time
samples_list = []
curr_stdev_max = 0
for trials in range(num_of_trials):
samples = [0] * sample_count
while sum(samples) != target_sum:
samples = [rd.randint(min_block_response_computational_time, max_block_response_computational_time) for trial in range(sample_count)]
# print ("Mean: ", st.mean(samples), "Std Dev: ", st.stdev(samples), )
# print (samples, "\n")
if st.stdev(samples) > curr_stdev_max:
curr_stdev_max = st.stdev(samples)
samples_best = samples[:]
return samples_best[0]
i take the first value in the list and use this as a timing value, however this code is REALLY slow, i need to call this piece of code several thousand times during the simulation so need to improve the efficency of the code some how
anyone got any suggestions on how to ?
To see where we'd get the best speed improvements, I started by profiling your code.
import cProfile
pr = cProfile.Profile()
pr.enable()
for i in range(100):
print(node_timing(8000, 7000, 9000))
pr.disable()
pr.print_stats(sort='time')
The top of the results show where your code is spending most of its time:
23561178 function calls (23561176 primitive calls) in 10.612 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
4502300 3.694 0.000 7.258 0.000 random.py:172(randrange)
4502300 2.579 0.000 3.563 0.000 random.py:222(_randbelow)
4502300 1.533 0.000 8.791 0.000 random.py:216(randint)
450230 1.175 0.000 9.966 0.000 counter.py:19(<listcomp>)
4608421 0.690 0.000 0.690 0.000 {method 'getrandbits' of '_random.Random' objects}
100 0.453 0.005 10.596 0.106 counter.py:5(node_timing)
4502300 0.294 0.000 0.294 0.000 {method 'bit_length' of 'int' objects}
450930 0.141 0.000 0.150 0.000 {built-in method builtins.sum}
100 0.016 0.000 0.016 0.000 {built-in method builtins.print}
600 0.007 0.000 0.025 0.000 statistics.py:105(_sum)
2200 0.005 0.000 0.006 0.000 fractions.py:84(__new__)
...
From this output, we can see that we're spending ~7.5 seconds (out of 10.6 seconds) generating random numbers. Therefore, the only way to make this noticeably faster is to generate fewer random numbers or generate them faster. You're not using a cryptographic random number generator so I don't have a way to make generating numbers faster. However, we can fudge the algorithm a bit and drastically reduce the number of values we need to generate.
Instead of only accepting samples with a mean of exactly 8000, what if we accepted samples with a mean of 8000 +- 0.1% (then we're taking samples with a mean of 7992 to 8008)? By being a tiny bit inexact, we can drastically speed up the algorithm. I replaced the while condition with:
while abs(sum(samples) - target_sum) > epsilon
Where epsilon = target_sum * 0.001. Then I ran the script again and got much better profiler numbers.
232439 function calls (232437 primitive calls) in 0.163 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
100 0.032 0.000 0.032 0.000 {built-in method builtins.print}
31550 0.026 0.000 0.053 0.000 random.py:172(randrange)
31550 0.019 0.000 0.027 0.000 random.py:222(_randbelow)
31550 0.011 0.000 0.064 0.000 random.py:216(randint)
4696 0.010 0.000 0.013 0.000 fractions.py:84(__new__)
3155 0.008 0.000 0.073 0.000 counter.py:19(<listcomp>)
600 0.008 0.000 0.039 0.000 statistics.py:105(_sum)
100 0.006 0.000 0.131 0.001 counter.py:4(node_timing)
32293 0.005 0.000 0.005 0.000 {method 'getrandbits' of '_random.Random' objects}
1848 0.004 0.000 0.009 0.000 fractions.py:401(_add)
Allowing the mean to be up to 0.1% off of the target dropped the number of calls to randint by 100x. Naturally, the code also runs 100x faster (and now spends most of its time printing to console).

Assign sampled multinomial values uniformly at random

I am using np.random.multinomial to sample a multinomial distribution M times (given probabilities [X_0 X_1 .. X_n] it returns counts [C_0 C_1 ... C_n] sampled from the specified multinomial, where \sum_i C_i = M). Given these sampled values (the C_i's), I want to assign them uniformly at random to some objects I have.
Currently what I'm doing is:
draws = np.random.multinomial(M, probs, size=1)
draws = draws[0]
draws_list = []
for idx,num in enumerate(draws):
draws_list += [idx]*num
random.shuffle(draws_list)
Then draws_list is a randomly shuffled list of the sampled values.
The problem is that populating draws_list (the for loop) is very slow. Is there a better/faster way to do this?
Try this code. This strategy is to allocate the memory first, then to fill data.
draws_list1 = np.empty(M, dtype=np.int)
acc = 0
for idx, num in enumerate(draws):
draws_list1[acc:acc+num].fill(idx)
acc += num
Here's the full code for profiling.
import numpy as np
import cProfile
M=10000000
draws = np.random.multinomial(M, [1/6.]*6, size=1)
draws = draws[0]
draws_list1 = np.empty(M, dtype=np.int)
def impl0():
draws_list0 = []
for idx, num in enumerate(draws):
draws_list0 += [idx]*num
return draws_list0
def impl1():
acc = 0
for idx, num in enumerate(draws):
draws_list1[acc:acc+num].fill(idx)
acc += num
return draws_list1
cProfile.run("impl0()")
cProfile.run("impl1()")
Here's the result of cProfile. If the statement np.empty is located in function impl1, 0.020 seconds are elapsed.
3 function calls in 0.095 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.020 0.020 0.095 0.095 <string>:1(<module>)
1 0.076 0.076 0.076 0.076 prof.py:11(impl0)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
9 function calls in 0.017 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.017 0.017 <string>:1(<module>)
1 0.000 0.000 0.017 0.017 prof.py:17(impl1)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
6 0.017 0.003 0.017 0.003 {method 'fill' of 'numpy.ndarray' objects}

Need for speed: Slow nested groupbys and applys in Pandas

I am performing a complex transformation on a DataFrame. I thought it would be quick for Pandas, but the only way I've managed to do it is with some nested groupbys and applys, using lambda functions, and it is slow. It seems like the sort of thing where there should be built-in, faster methods. At n_rows=1000 it's 2 seconds, but I'll be doing 10^7 rows, so this is far too slow. It's difficult to explain what we're doing, so here's the code and profile, then I'll explain:
n_rows = 1000
d = pd.DataFrame(randint(1,10,(n_rows,8))) #Raw data
dgs = array([3,4,1,8,9,2,3,7,10,8]) #Values we will look up, referenced by index
grps = pd.cut(randint(1,5,n_rows),arange(1,5)) #Grouping
f = lambda x: dgs[x.index].mean() #Works on a grouped Series
g = lambda x: x.groupby(x).apply(f) #Works on a Series
h = lambda x: x.apply(g,axis=1).mean(axis=0) #Works on a grouped DataFrame
q = d.groupby(grps).apply(h) #Slow
824984 function calls (816675 primitive calls) in 1.850 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
221770 0.105 0.000 0.105 0.000 {isinstance}
7329 0.104 0.000 0.217 0.000 index.py:86(__new__)
8309 0.089 0.000 0.423 0.000 series.py:430(__new__)
5375 0.081 0.000 0.081 0.000 {method 'reduce' of 'numpy.ufunc' objects}
34225 0.068 0.000 0.133 0.000 {method 'view' of 'numpy.ndarray' objects}
36780/36779 0.067 0.000 0.067 0.000 {numpy.core.multiarray.array}
5349 0.065 0.000 0.567 0.000 series.py:709(_get_values)
985/1 0.063 0.000 1.847 1.847 groupby.py:608(apply)
5349 0.056 0.000 0.198 0.000 _methods.py:42(_mean)
5358 0.050 0.000 0.232 0.000 index.py:332(__getitem__)
8309 0.049 0.000 0.228 0.000 series.py:3299(_sanitize_array)
9296 0.047 0.000 0.116 0.000 index.py:1341(__new__)
984 0.039 0.000 0.092 0.000 algorithms.py:105(factorize)
Group the DataFrame rows by the groupings. For each grouping, for each row, group by those values that are the same (i.e. all have the value 3 versus all have value 4). For each index in a value grouping, look up the corresponding index in dgs, and average. Then average for the row groupings.
::exhale::
Any suggestions on how to rearrange this for speed would be appreciated.
You can do the apply and groupby by one multilevel groupby, here is the code:
import pandas as pd
from numpy import array, arange
from numpy.random import randint, seed
seed(42)
n_rows = 1000
d = pd.DataFrame(randint(1,10,(n_rows,8))) #Raw data
dgs = array([3,4,1,8,9,2,3,7,10,8]) #Values we will look up, referenced by index
grps = pd.cut(randint(1,5,n_rows),arange(1,5)) #Grouping
f = lambda x: dgs[x.index].mean() #Works on a grouped Series
g = lambda x: x.groupby(x).apply(f) #Works on a Series
h = lambda x: x.apply(g,axis=1).mean(axis=0) #Works on a grouped DataFrame
print d.groupby(grps).apply(h) #Slow
### my code starts from here ###
def group_process(df2):
s = df2.stack()
v = np.repeat(dgs[None, :df2.shape[1]], df2.shape[0], axis=0).ravel()
return pd.Series(v).groupby([s.index.get_level_values(0), s.values]).mean().mean(level=1)
print d.groupby(grps).apply(group_process)
output:
1 2 3 4 5 6 7 \
(1, 2] 4.621575 4.625887 4.775235 4.954321 4.566441 4.568111 4.835664
(2, 3] 4.446347 4.138528 4.862613 4.800538 4.582721 4.595890 4.794183
(3, 4] 4.776144 4.510119 4.391729 4.392262 4.930556 4.695776 4.630068
8 9
(1, 2] 4.246085 4.520384
(2, 3] 5.237360 4.418934
(3, 4] 4.829167 4.681548
[3 rows x 9 columns]
1 2 3 4 5 6 7 \
(1, 2] 4.621575 4.625887 4.775235 4.954321 4.566441 4.568111 4.835664
(2, 3] 4.446347 4.138528 4.862613 4.800538 4.582721 4.595890 4.794183
(3, 4] 4.776144 4.510119 4.391729 4.392262 4.930556 4.695776 4.630068
8 9
(1, 2] 4.246085 4.520384
(2, 3] 5.237360 4.418934
(3, 4] 4.829167 4.681548
[3 rows x 9 columns]
It's about 70x faster, but I don't know if it can work with 10**7 rows.

Categories