I have a dataframe of 6M+ observations, where 20 of the columns are weights that will be applied to a single score column. I.e., Wgt1 * Wgt2 * Wgt3...* Score. In addition, not each weight is applicable to every observation, so I have created 20 columns that represent a weight mask. I.e., (Wgt1*Msk1) * (Wgt2*Msk2) * (Wgt3*Msk3) ... Score. When the mask is 0, the weight is not applicable; when the mask is 1, it is applicable.
For each row in the dataframe, I want to:
1, Check 2 qualifying conditions that indicate the row should be processed
2, find the product of the weights, subject to the presence of the corresponding mask (ttl_wgt)
3, multiply this product by the score (prob) to create a final weighted score
To do this, I have created a user-defined function:
import functools
import operator
import time
def mymult(a):
ttl_wgt = float('NaN') #Initialize to NaN
if ~np.isnan(a['ID']): #condition 1, only process if an ID is present
if a['prob'] > -1.0: #condition 2, only process if our unweighted score is NOT -1.0
b = np.where(a[msks] ==1)[0] #index for which of our masks is 1?
ttl_wgt = functools.reduce(operator.mul, a[np.asarray(wgt_nms)[b]], 1)
return ttl_wgt
I ran out of memory during development, so I decided to chunk it up into 500000 rows at a time. I use a lambda function to apply to the chunk:
msks = ['Msk1','Msk2','Msk3','Msk4',...,'Msk20']
wgt_nms = ['Wgt1','Wgt2','Wgt3','Wgt4',...,'Wgt20']
print('Determining final weights...')
chunksize = 500000 #we'll operate on this many rows at a time
start_time = time.time()
ttl_wgts = [] #initialize list to hold weight products
for i in range(0,len(df),chunksize):
ttl_wgts.extend(df[i:(i+chunksize)].apply(lambda x: mymult(x), axis=1))
print("--- %s seconds ---" % (time.time() - start_time)) #Expect between 30 and 40 minutes
print('Done!')
Then I assignthe ttl_wgts list as a new column in the dataframe, and do the final product of weight * initial score.
#Initialize the fields
#Might not be necessary or evenuseful
df['ttl_wgt'] = float('NaN')
df['wgt_prob'] = float('NaN')
df['ttl_wgt'] = ttl_wgts
df['wgt_prob'] = df['ttl_wgt'] * df['prob']
I checked out a prior post on multiplying elements in a list. It was great food for thought, but I wasn't able to turn it into anything more efficient for my 6M+ observations. Are there other approaches I should be considering?
Adding an example df, as suggested
A sample of the dataframe might looks something like this, with only 3 masks/weights:
df = pd.DataFrame({'id': [999999999,136550,80010170,80010177,90002408,90002664,16207501,62992,np.nan,80010152],
'prob': [-1,0.180274382,0.448361456,0.000945058,0.005060279,0.009893078,0.169686288,0.109541453,0.117907763,0.266242921],
'Msk1': [0,1,1,1,0,0,1,0,0,0],
'Msk2': [0,0,1,0,0,0,0,1,0,0],
'Msk3': [1,0,0,0,1,1,0,0,1,1],
'Wgt1': [np.nan,0.919921875,1.08984375,1.049804688,np.nan,np.nan,np.nan,0.91015625,np.nan,0.810058594],
'Wgt2': [np.nan,1.129882813,1.120117188,0.970214844,np.nan,np.nan,np.nan,1.0703125,np.nan,0.859863281],
'Wgt3': [np.nan,1.209960938,1.23046875,1,np.nan,np.nan,np.nan,1.150390625,np.nan,0.649902344]
})
In the first observation, the prob field is -1, so the row would not be processed. In the second observation, Msk1 is turned on while Msk2 and Msk3 are turned off. Thus the final weight would be the value of Wgt1 = 0.919922. In the 3rd row, Mask1 and Msk2 are on, while Msk3 is off. Therefore the final weight would be Wgt1*Wgt2 = 1.089844*1.120117 = 1.220752.
IIUC:
you want to fill in your masked weights with 1. Then you can multiply them all together with no impact from the ones being masked. That's the trick. You'll have to apply it as needed.
create msk
msk = df.filter(like='Msk')
print(msk)
Msk1 Msk2 Msk3
0 0 0 1
1 1 0 0
2 1 1 0
3 1 0 0
4 0 0 1
5 0 0 1
6 1 0 0
7 0 1 0
8 0 0 1
9 0 0 1
create wgt
wgt = df.filter(like='Wgt')
print(wgt)
Wgt1 Wgt2 Wgt3
0 NaN NaN NaN
1 0.919922 1.129883 1.209961
2 1.089844 1.120117 1.230469
3 1.049805 0.970215 1.000000
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 0.910156 1.070312 1.150391
8 NaN NaN NaN
9 0.810059 0.859863 0.649902
create new_weight
new_wgt = np.where(msk, wgt, 1)
print(new_wgt)
[[ 1. 1. nan]
[ 0.91992188 1. 1. ]
[ 1.08984375 1.12011719 1. ]
[ 1.04980469 1. 1. ]
[ 1. 1. nan]
[ 1. 1. nan]
[ nan 1. 1. ]
[ 1. 1.0703125 1. ]
[ 1. 1. nan]
[ 1. 1. 0.64990234]]
final prod_wgt
prod_wgt = pd.Series(new_wgt.prod(1), wgt.index)
print(prod_wgt)
0 NaN
1 0.919922
2 1.220753
3 1.049805
4 NaN
5 NaN
6 NaN
7 1.070312
8 NaN
9 0.649902
dtype: float64
Related
I have a data frame that looks like this:
data_dict = {'factor_1' : np.random.randint(1, 5, 10), 'factor_2' : np.random.randint(1, 5, 10), 'multi' : np.random.rand(10), 'output' : np.NaN}
df = pd.DataFrame(data_dict)
I'm getting stuck implementing this comparison:
If factor_1 and factor_2 values match, then output = 2 * multi (Here 2 is kind of a base value). Continue scanning the next rows.
If factor_1 and factor_2 values don't match then:
output = -2. Scan the next row(s).
If factor values still don't match until row R then assign values for output as $-2^2, -2^3, ..., -2^R$ respectively.
When factor values match at row R+1 then assign value for output as $2^(R+1) * multi$.
Repeat the process
The end result will look like this:
This solution does not use recursion:
# sample data
np.random.seed(1)
data_dict = {'factor_1' : np.random.randint(1, 5, 10), 'factor_2' : np.random.randint(1, 5, 10), 'multi' : np.random.rand(10), 'output' : np.NaN}
df = pd.DataFrame(data_dict)
# create a mask
mask = (df['factor_1'] != df['factor_2'])
# get the cumsum from the mask
df['R'] = mask.cumsum() - mask.cumsum().where(~mask).ffill().fillna(0)
# use np.where to create the output
df['output'] = np.where(df['R'] == 0, df['multi']*2, -2**df['R'])
factor_1 factor_2 multi output R
0 2 1 0.419195 -2.000000 1.0
1 4 2 0.685220 -4.000000 2.0
2 1 1 0.204452 0.408904 0.0
3 1 4 0.878117 -2.000000 1.0
4 4 2 0.027388 -4.000000 2.0
5 2 1 0.670468 -8.000000 3.0
6 4 3 0.417305 -16.000000 4.0
7 2 2 0.558690 1.117380 0.0
8 4 3 0.140387 -2.000000 1.0
9 1 1 0.198101 0.396203 0.0
The solution I present is, maybe, a little bit harder to read, but I think it works as you wanted. It combines
numpy.where() in order to make a column based on a condition,
pandas.DataFrame.shift() and pandas.DataFrame.cumsum() to label different groups with consecutive similar values, and
pandas.DataFrame.rank() in order to construct a vector of powers used on previously made df['output'] column.
The code is following.
df['output'] = np.where(df.factor_1 == df.factor_2, -2 * df.multi, 2)
group = ['output', (df.output != df.output.shift()).cumsum()]
df['output'] = (-1) * df.output.pow(df.groupby(group).output.rank('first'))
flag = False
cols = ('factor_1', 'factor_2', 'multi')
z = zip(*[data_dict[col] for col in cols])
for i, (f1, f2, multi) in enumerate(z):
if f1==f2:
output = 2 * multi
flag = False
else:
if flag:
output *= 2
else:
output = -2
flag = True
data_dict['output'][i] = output
The tricky part is flag variable, which tells you whether the previous row had match or not.
I am running this code on a large csv file (1.5 million rows). Is there a way to optimise ?
df is a pandas dataframe.
I take a row and want to know what happens 1st in the 1000 folowing rows :
I find my value + 0.0004 or i find my value - 0.0004
result = []
for row in range(len(df)-1000):
start = df.get_value(row,'A')
win = start + 0.0004
lose = start - 0.0004
for n in range(1000):
ref = df.get_value(row + n,'B')
if ref > win:
result.append(1)
break
elif ref <= lose:
result.append(-1)
break
elif n==999 :
result.append(0)
the dataframe is like :
timestamp A B
0 20190401 00:00:00.127 1.12230 1.12236
1 20190401 00:00:00.395 1.12230 1.12237
2 20190401 00:00:00.533 1.12229 1.12234
3 20190401 00:00:00.631 1.12228 1.12233
4 20190401 00:00:01.019 1.12230 1.12234
5 20190401 00:00:01.169 1.12231 1.12236
the result is : result[0,0,1,0,0,1,-1,1,…]
this is working but takes a long time to process on such large files.
To generate values for the "first outlier", define the following function:
def firstOutlier(row, dltRow = 4, dltVal = 0.1):
''' Find the value for the first "outlier". Parameters:
row - the current row
dltRow - number of rows to check, starting from the current
dltVal - delta in value of "B", compared to "A" in the current row
'''
rowInd = row.name # Index of the current row
df2 = df.iloc[rowInd : rowInd + dltRow] # "dltRow" rows from the current
outliers = df2[abs(df2.B - row.A) >= dlt]
if outliers.index.size == 0: # No outliers within the range of rows
return 0
return int(np.sign(outliers.iloc[0].B - row.A))
Then apply it to each row:
df.apply(firstOutlier, axis=1)
This function relies on the fact that the DataFrame has the index consisting
of consecutive numbers, starting from 0, so that having ind - the index of
any row we can access it calling df.iloc[ind] and a slice of n rows,
starting from this row, calling df.iloc[ind : ind + n].
For my test, I set the default values of parameters to:
dltRow = 4 - look at 4 rows, starting from the current one,
dltVal = 0.1 - look for rows with B column "distant by" 0.1
or more from A in the current row.
My test DataFrame was:
A B
0 1.00 1.00
1 0.99 1.00
2 1.00 0.80
3 1.00 1.05
4 1.00 1.20
5 1.00 1.00
6 1.00 0.80
7 1.00 1.00
8 1.00 1.00
The result (for my data and default values of parameters) was:
0 -1
1 -1
2 -1
3 1
4 1
5 -1
6 -1
7 0
8 0
dtype: int64
For your needs, change default values of params to 1000 and 0.0004 respectively.
The idea is to loop through A and B while maintaining a sorted list of A values. Then, for each B, find the highest A that loses and the lowest A that wins. Since it's a sorted list it's O(log(n)) to search. Only those A's that have index in the last 1000 are used for setting the result vector. After that the A's that are no longer waiting for a B are removed from this sorted list to keep it small.
import numpy as np
import bisect
import time
N = 10
M = 3
#N=int(1e6)
#M=int(1e3)
thresh = 0.4
A = np.random.rand(N)
B = np.random.rand(N)
result = np.zeros(N)
l = []
t_start = time.time()
for i in range(N):
a = (A[i],i)
bisect.insort(l,a)
b = B[i]
firstLoseInd = bisect.bisect_left(l,(b+thresh,-1))
lastWinInd = bisect.bisect_right(l,(b-thresh,-1))
for j in range(lastWinInd):
curInd = l[j][1]
if curInd > i-M:
result[curInd] = 1
for j in range(firstLoseInd,len(l)):
curInd = l[j][1]
if curInd > i-M:
result[curInd] = -1
del l[firstLoseInd:]
del l[:lastWinInd]
t_done = time.time()
print(A)
print(B)
print(result)
print(t_done - t_start)
This is a sample output:
[ 0.22643589 0.96092354 0.30098532 0.15569044 0.88474775 0.25458535
0.78248271 0.07530432 0.3460113 0.0785128 ]
[ 0.83610433 0.33384085 0.51055061 0.54209458 0.13556121 0.61257179
0.51273686 0.54850825 0.24302884 0.68037965]
[ 1. -1. 0. 1. -1. 0. -1. 1. 0. 1.]
For N = int(1e6) and M = int(1e3) it took about 3.4 seconds on my computer.
My data looks like:
list=[44359, 16610, 8364, ..., 1, 1, 1]
For each element in list I want to take i*([i+1]+[i-1])/2, where i is an element in the list, and i+1 and i-1 are the adjacent elements.
For some reason I cannot seem to do this cleanly in NumPy.
Here's what I've tried:
weights=[]
weights.append(1)
for i in range(len(hoff[3])-1):
weights.append((hoff[3][i-1]+hoff[3][i+1])/2)
Where I append 1 to the weights list so that lengths will match at the end. I arbitrarily picked 1, I'm not sure how to deal with the leftmost and rightmost points either.
You can use numpy's array operations to represent your "loop". If you think of data as bellow, where pL and pR are the values you choose to "pad" your data with on the left and right:
[pL, 0, 1, 2, ..., N-2, N-1, pR]
What you're trying to do is this:
[0, ..., N - 1] * ([pL, 0, ..., N-2] + [1, ..., N -1, pR]) / 2
Written in code it looks something like this:
import numpy as np
data = np.random.random(10)
padded = np.concatenate(([data[0]], data, [data[-1]]))
data * (padded[:-2] + padded[2:]) / 2.
Repeating the first and last value is known as "extending" in image processing, but there are other edge handling methods you could try.
I would use pandas for this, filling in the missing left- and right-most values with 1 (but you can use any value you want):
import numpy
import pandas
numpy.random.seed(0)
data = numpy.random.randint(0, 10, size=15)
df = (
pandas.DataFrame({'hoff': data})
.assign(before=lambda df: df['hoff'].shift(1).fillna(1).astype(int))
.assign(after=lambda df: df['hoff'].shift(-1).fillna(1).astype(int))
.assign(weight=lambda df: df['hoff'] * df[['before', 'after']].mean(axis=1))
)
print(df.to_string(index=False)
And that gives me:
hoff before after weight
5 1 0 2.5
0 5 3 0.0
3 0 3 4.5
3 3 7 15.0
7 3 9 42.0
9 7 3 45.0
3 9 5 21.0
5 3 2 12.5
2 5 4 9.0
4 2 7 18.0
7 4 6 35.0
6 7 8 45.0
8 6 8 56.0
8 8 1 36.0
1 8 1 4.5
A pure numpy-based solution would look like this (again, filling with 1):
before_after = numpy.ones((data.shape[0], 2))
before_after[1:, 0] = data[:-1]
before_after[:-1, 1] = data[1:]
weights = data * before_after.mean(axis=1)
print(weights)
array([ 2.5, 0. , 4.5, 15. , 42. , 45. , 21. , 12.5, 9. ,
18. , 35. , 45. , 56. , 36. , 4.5])
I need to write a code to create the new column B embedding the cumsum of the first column A.
When the cumsum value is < 0, then the value in B should be 0.
Then cumsum starts again until next value <0.
I search similar answer but I was not able to find an answer fitting my case. Thanks for your help.
A B
1 1
3 4
5 9
7 16
-6 10
-8 2
-10 *0*
6 6
-15 *0*
11 11
Set up a loop over A and have the total. If the total is less than 0 then set it to 0. Then append the new total to B
You have a A = [ 1, ...,], total = 0, B = []
total = 0
B = []
for i in range(len(A)):
# process the sum
total += A[i]
if total < 0:
total = 0
B.append(total)
Here is a non-Pandas answer that iteratively loops through the values in column A and creates column B by never going below 0.
result = []
cur_res = 0
for i in df.A:
cur_res = max(cur_res + i, 0)
result.append(cur_res)
df['B'] = result
I have a very large pandas dataset, and at some point I need to use the following function
def proc_trader(data):
data['_seq'] = np.nan
# make every ending of a roundtrip with its index
data.ix[data.cumq == 0,'tag'] = np.arange(1, (data.cumq == 0).sum() + 1)
# backfill the roundtrip index until previous roundtrip;
# then fill the rest with 0s (roundtrip incomplete for most recent trades)
data['_seq'] =data['tag'].fillna(method = 'bfill').fillna(0)
return data['_seq']
# btw, why on earth this function returns a dataframe instead of the series `data['_seq']`??
and I use apply
reshaped['_spell']=reshaped.groupby(['trader','stock'])[['cumq']].apply(proc_trader)
Obviously, I cannot share the data here, but do you see a bottleneck in my code? Could it be the arange thing? There are many name-productid combinations in the data.
Minimal Working Example:
import pandas as pd
import numpy as np
reshaped= pd.DataFrame({'trader' : ['a','a','a','a','a','a','a'],'stock' : ['a','a','a','a','a','a','b'], 'day' :[0,1,2,4,5,10,1],'delta':[10,-10,15,-10,-5,5,0] ,'out': [1,1,2,2,2,0,1]})
reshaped.sort_values(by=['trader', 'stock','day'], inplace=True)
reshaped['cumq']=reshaped.groupby(['trader', 'stock']).delta.transform('cumsum')
reshaped['_spell']=reshaped.groupby(['trader','stock'])[['cumq']].apply(proc_trader).reset_index()['_seq']
Nothing really fancy here, just tweaked in a couple of places. There is really no need to put in a function, so I didn't. On this tiny sample data, it's about twice as fast as the original.
reshaped.sort_values(by=['trader', 'stock','day'], inplace=True)
reshaped['cumq']=reshaped.groupby(['trader', 'stock']).delta.cumsum()
reshaped.loc[ reshaped.cumq == 0, '_spell' ] = 1
reshaped['_spell'] = reshaped.groupby(['trader','stock'])['_spell'].cumsum()
reshaped['_spell'] = reshaped.groupby(['trader','stock'])['_spell'].bfill().fillna(0)
Result:
day delta out stock trader cumq _spell
0 0 10 1 a a 10 1.0
1 1 -10 1 a a 0 1.0
2 2 15 2 a a 15 2.0
3 4 -10 2 a a 5 2.0
4 5 -5 2 a a 0 2.0
5 10 5 0 a a 5 0.0
6 1 0 1 b a 0 1.0