I am installing a ranking system and basically I have a field called site_fees that accounts for 10% of the total for consideration. A site fee of 0 would get all 10 points. What I want to do is calculate how many points the non-zero fields would get, but I am struggling to do so.
My initial approach was to split the dataframe into 2 dataframes (dfb where site_fees are 0 and dfa where they are > 0) and calculate the average for dfa, assign the rating for dfb as 10, then union the two.
The code is as follows:
dfSitesa = dfSites[dfSites['site_fees'].notnull()]
dfSitesb = dfSites[dfSites['site_fees'].isnull()]
dfSitesa['rating'] = FeeWeight * \
dfSitesa['site_fees'].min()/dfSitesa['site_fees']
dfSitesb['rating'] = FeeWeight
dfSites = pd.concat([dfSitesa,dfSitesb])
This produces an output, however the results of dfa are not correct because the minimum of dfa is 5000 instead of 0, so the rating of a site with $5000 in fees is 10 (the maximum, not correct). What am I doing wrong?
The minimum non-zero site_fee is 5000 and the maximum is 15000. Based on this, I would expect a general ranking system like:
15000 | 0
10000 | 3.3
5000 | 6.6
0 | 10
Here is a way to do it :
dfSites = pd.DataFrame({'site_fees':[0,1,2,3,5]})
FeeWeight = 10
dfSitesa = dfSites[dfSites['site_fees'].notnull()]
dfSitesb = dfSites[dfSites['site_fees'].isnull()]
dfSitesb['rating'] = FeeWeight
factor = (dfSitesa['site_fees'].max() - dfSitesa['site_fees'].min())
dfSitesa['rating'] = FeeWeight * ( 1 - ( (dfSitesa['site_fees'] - dfSitesa['site_fees'].min()) / factor) )
dfSites = pd.concat([dfSitesa,dfSitesb])
In [1] : print(dfSites)
Out[1] :
site_fees rating
0 0 10.0
1 1 8.0
2 2 6.0
3 3 4.0
4 5 0.0
Related
I have a parking lot with cars of different models (nr) and the cars are so closely packed that in order for one to get out one might need to move some others. A little like a 15Puzzle, only I can take one or more cars out of the parking lot. Ordered_car_List includes the cars that will be picked up today, and they need to be taken out of the parking lot with as few non-ordered cars as possible moved. There are more columns to this panda, but this is what I can't figure out.
I have a Program that works good for small sets of data, but it seems that this is not the way of the PANDAS :-)
I have this:
cars = pd.DataFrame({'x': [1,1,1,1,1,2,2,2,2],
'y': [1,2,3,4,5,1,2,3,4],
'order_number':[6,6,7,6,7,9,9,10,12]})
cars['order_number_no_dublicates_down'] = None
Ordered_car_List = [6,9,9,10,28]
i=0
while i < len(cars):
temp_val = cars.at[i, 'order_number']
if temp_val in Ordered_car_List:
cars.at[i, 'order_number_no_dublicates_down'] = temp_val
Ordered_car_List.remove(temp_val)
i+=1
If I use cars.apply(lambda..., how can I change the Ordered_car_List in each iteration?
Is there another approach that I can take?
I found this page, and it made me want to be faster. The Lambda approach is in the middle when it comes to speed, but it still is so much faster than what I am doing now.
https://towardsdatascience.com/how-to-make-your-pandas-loop-71-803-times-faster-805030df4f06
Updating cars
We can vectorize this based on two counters:
cumcount() to cumulatively count each unique value in cars['order_number']
collections.Counter() to count each unique value in Ordered_car_List
cumcount = cars.groupby('order_number').cumcount().add(1)
maxcount = cars['order_number'].map(Counter(Ordered_car_List))
# order_number cumcount maxcount
# 0 6 1 1
# 1 6 2 1
# 2 7 1 0
# 3 6 3 1
# 4 7 2 0
# 5 9 1 2
# 6 9 2 2
# 7 10 1 1
# 8 12 1 0
So then we only want to keep cars['order_number'] where cumcount <= maxcount:
either use DataFrame.loc[]
cars.loc[cumcount <= maxcount, 'nodup'] = cars['order_number']
or Series.where()
cars['nodup'] = cars['order_number'].where(cumcount <= maxcount)
or Series.mask() with the condition inverted
cars['nodup'] = cars['order_number'].mask(cumcount > maxcount)
Updating Ordered_car_List
The final Ordered_car_List is a Counter() difference:
Used_car_List = cars.loc[cumcount <= maxcount, 'order_number']
# [6, 9, 9, 10]
Ordered_car_List = list(Counter(Ordered_car_List) - Counter(Used_car_List))
# [28]
Final output
cumcount = cars.groupby('order_number').cumcount().add(1)
maxcount = cars['order_number'].map(Counter(Ordered_car_List))
cars['nodup'] = cars['order_number'].where(cumcount <= maxcount)
# x y order_number nodup
# 0 1 1 6 6.0
# 1 1 2 6 NaN
# 2 1 3 7 NaN
# 3 1 4 6 NaN
# 4 1 5 7 NaN
# 5 2 1 9 9.0
# 6 2 2 9 9.0
# 7 2 3 10 10.0
# 8 2 4 12 NaN
Used_car_List = cars.loc[cumcount <= maxcount, 'order_number']
Ordered_car_List = list(Counter(Ordered_car_List) - Counter(Used_car_List))
# [28]
Timings
Note that your loop is still very fast with small data, but the vectorized counter approach just scales much better:
I have problem calculating variance with "hidden" NULL (zero) values. Usually that shouldn't be a problem because NULL value is not a value but in my case it is essential to include those NULLs as zero to variance calculation. So I have Dataframe that looks like this:
TableA:
A X Y
1 1 30
1 2 20
2 1 15
2 2 20
2 3 20
3 1 30
3 2 35
Then I need to get variance for each different X value and I do this:
TableA.groupby(['X']).agg({'Y':'var'})
But answer is not what I need since I would need the variance calculation to include also NULL value Y for X=3 when A=1 and A=3.
What my dataset should look like to get the needed variance results:
A X Y
1 1 30
1 2 20
1 3 0
2 1 15
2 2 20
2 3 20
3 1 30
3 2 35
3 3 0
So I need variance to take into account that every X should have 1,2 and 3 and when there are no values for Y in certain X number it should be 0. Could you help me in this? How should I change my TableA dataframe to be able to do this or is there another way?
Desired output for TableA should be like this:
X Y
1 75.000000
2 75.000000
3 133.333333
Compute the variance directly, but divide by the number of different possibilities for A
# three in your example. adjust as needed
a_choices = len(TableA['A'].unique())
def variance_with_missing(vals):
mean_with_missing = np.sum(vals) / a_choices
ss_present = np.sum((vals - mean_with_missing)**2)
ss_missing = (a_choices - len(vals)) * mean_with_missing**2
return (ss_present + ss_missing) / (a_choices - 1)
TableA.groupby(['X']).agg({'Y': variance_with_missing})
Approach of below solution is appending not existing sequence with Y=0. Little messy but hope this will help.
import numpy as np
import pandas as pd
TableA = pd.DataFrame({'A':[1,1,2,2,2,3,3],
'X':[1,2,1,2,3,1,2],
'Y':[30,20,15,20,20,30,35]})
TableA['A'] = TableA['A'].astype(int)
#### Create row with non existing sequence and fill with 0 ####
for i in range(1,TableA.X.max()+1):
for j in TableA.A.unique():
if not TableA[(TableA.X==i) & (TableA.A==j)]['Y'].values :
TableA = TableA.append(pd.DataFrame({'A':[j],'X':[i],'Y':[0]}),ignore_index=True)
TableA.groupby('X').agg({'Y':np.var})
My dataframe has a daily price column and a window size column :
df = pd.DataFrame(columns = ['price', 'window'],
data = [[100, 1],[120, 2], [115, 2], [116, 2], [100, 4]])
df
price window
0 100 1
1 120 2
2 115 2
3 116 2
4 100 4
I would like to compute the rolling mean of price for each row using the window of the window column.
The result would be this :
df
price window rolling_mean_price
0 100 1 100.00
1 120 2 110.00
2 115 2 117.50
3 116 2 115.50
4 100 4 112.75
I don't find any elegant way to do it with apply and I refuse to loop over each row of my DataFrame...
The best solutions, in terms of raw speed and complexity, are based on ideas from summed-area table. The problem can be consider as a table of one dimension. Below you can find several approaches, ranked from best to worst.
Numpy + Linear complexity
size = len(df['price'])
price = np.zeros(size + 1)
price[1:] = df['price'].values.cumsum()
window = np.clip(np.arange(size) - (df['window'].values - 1), 0, None)
df['rolling_mean_price'] = (price[1:] - price[window]) / df['window'].values
print(df)
Output
price window rolling_mean_price
0 100 1 100.00
1 120 2 110.00
2 115 2 117.50
3 116 2 115.50
4 100 4 112.75
Loopy + Linear complexity
price = df['price'].values.cumsum()
df['rolling_mean_price'] = [(price[i] - float((i - w) > -1) * price[i-w]) / w for i, w in enumerate(df['window'])]
Loopy + Quadratic complexity
price = df['price'].values
df['rolling_mean_price'] = [price[i - (w - 1):i + 1].mean() for i, w in enumerate(df['window'])]
I would not recommend this approach using pandas.DataFrame.apply() (reasons described here), but if you insist on it, here is one solution:
df['rolling_mean_price'] = df.apply(
lambda row: df.rolling(row.window).price.mean().iloc[row.name], axis=1)
The output looks like this:
>>> print(df)
price window rolling_mean_price
0 100 1 100.00
1 120 2 110.00
2 115 2 117.50
3 116 2 115.50
4 100 4 112.75
I have the following data:
df = pd.DataFrame({'sound': ['A', 'B', 'B', 'A', 'B', 'A'],
'score': [10, 5, 6, 7, 11, 1]})
print(df)
sound score
0 A 10
1 B 5
2 B 6
3 A 7
4 B 11
5 A 1
If I standardize (i.e. Z score) the score variable, I get the following values. The mean of the new z column is basically 0, with SD of 1, both of which are expected for a standardized variable:
df['z'] = (df['score'] - df['score'].mean())/df['score'].std()
print(df)
print('Mean: {}'.format(df['z'].mean()))
print('SD: {}'.format(df['z'].std()))
sound score z
0 A 10 0.922139
1 B 5 -0.461069
2 B 6 -0.184428
3 A 7 0.092214
4 B 11 1.198781
5 A 1 -1.567636
Mean: -7.401486830834377e-17
SD: 1.0
However, what I'm actually interested in is calculating Z scores based on group membership (sound). For example, if a score is from sound A, then convert that value to a Z score using the mean and SD of sound A values only. Likewise, sound B Z scores will only use mean and SD from sound B. This will obviously produce different values compared to regular Z score calculation:
df['zg'] = df.groupby('sound')['score'].transform(lambda x: (x - x.mean()) / x.std())
print(df)
print('Mean: {}'.format(df['zg'].mean()))
print('SD: {}'.format(df['zg'].std()))
sound score z zg
0 A 10 0.922139 0.872872
1 B 5 -0.461069 -0.725866
2 B 6 -0.184428 -0.414781
3 A 7 0.092214 0.218218
4 B 11 1.198781 1.140647
5 A 1 -1.567636 -1.091089
Mean: 3.700743415417188e-17
SD: 0.894427190999916
My question is: why is the mean of the group-based standardized values (zg) also basically equal to 0? Is this expected behaviour or is there an error in my calculation somewhere?
The z scores make sense because standardizing within a variable essentially forces the mean to 0. But the zg values are calculated using different means and SDs for each sound group, so I'm not sure why the mean of that new variable has also been set to 0.
The only situation where I can see this happening is if the sum of values > 0 is equal to sum of values < 0, which when averaged would cancel out to 0. This happens in a regular Z score calculation but I'm surprised that this also happens when operating across multiple groups like this...
I think it makes perfect sense. If E[abc | def ] is the expectation of abc given def), then in df['zg']:
m1 = E['zg' | sound = 'A'] = (0.872872 + 0.218218 -1.091089)/3 ~ 0
m2 = E['zg' | sound = 'B'] = (-0.725866 - 0.414781 + 1.140647)/3 ~ 0
and
E['zg'] = (m1+m2)/2 = (0.872872 + 0.218218 -1.091089 -0.725866 - 0.414781 + 1.140647)/6 ~ 0
Yes, this is expected behavior.
In fancy words, using the Law of Iterated Expectations,
And specifically, if groups Y are finite and thus countable,
where
However, by construction, every E[X|Y_j] is 0 for all values of Y in your set G of possible groups.
Thus, the total average will also be zero.
I have a very large pandas dataset, and at some point I need to use the following function
def proc_trader(data):
data['_seq'] = np.nan
# make every ending of a roundtrip with its index
data.ix[data.cumq == 0,'tag'] = np.arange(1, (data.cumq == 0).sum() + 1)
# backfill the roundtrip index until previous roundtrip;
# then fill the rest with 0s (roundtrip incomplete for most recent trades)
data['_seq'] =data['tag'].fillna(method = 'bfill').fillna(0)
return data['_seq']
# btw, why on earth this function returns a dataframe instead of the series `data['_seq']`??
and I use apply
reshaped['_spell']=reshaped.groupby(['trader','stock'])[['cumq']].apply(proc_trader)
Obviously, I cannot share the data here, but do you see a bottleneck in my code? Could it be the arange thing? There are many name-productid combinations in the data.
Minimal Working Example:
import pandas as pd
import numpy as np
reshaped= pd.DataFrame({'trader' : ['a','a','a','a','a','a','a'],'stock' : ['a','a','a','a','a','a','b'], 'day' :[0,1,2,4,5,10,1],'delta':[10,-10,15,-10,-5,5,0] ,'out': [1,1,2,2,2,0,1]})
reshaped.sort_values(by=['trader', 'stock','day'], inplace=True)
reshaped['cumq']=reshaped.groupby(['trader', 'stock']).delta.transform('cumsum')
reshaped['_spell']=reshaped.groupby(['trader','stock'])[['cumq']].apply(proc_trader).reset_index()['_seq']
Nothing really fancy here, just tweaked in a couple of places. There is really no need to put in a function, so I didn't. On this tiny sample data, it's about twice as fast as the original.
reshaped.sort_values(by=['trader', 'stock','day'], inplace=True)
reshaped['cumq']=reshaped.groupby(['trader', 'stock']).delta.cumsum()
reshaped.loc[ reshaped.cumq == 0, '_spell' ] = 1
reshaped['_spell'] = reshaped.groupby(['trader','stock'])['_spell'].cumsum()
reshaped['_spell'] = reshaped.groupby(['trader','stock'])['_spell'].bfill().fillna(0)
Result:
day delta out stock trader cumq _spell
0 0 10 1 a a 10 1.0
1 1 -10 1 a a 0 1.0
2 2 15 2 a a 15 2.0
3 4 -10 2 a a 5 2.0
4 5 -5 2 a a 0 2.0
5 10 5 0 a a 5 0.0
6 1 0 1 b a 0 1.0