Pandas select value in one dataframe, based on value in other df - python

df1:
GAME PLAY BET
0 (SWE, FIN) DRAW 10
1 (DEN, GER) WIN 20
2 (RUS, CZE) LOSS 30
df2:
GAME WIN DRAW LOSS
0 (SWE, FIN) 1.50 2.0 3.25
1 (DEN, GER) 2.00 2.5 2.10
2 (RUS, CZE) 1.05 2.1 10.00
I'd like to create a column "PAYOFF" in df1, for each game. Payoff is computed by fetching the actual odds (WIN/DRAW/LOSS) from df2, by multiplying that value with the "BET" from df1. For instance, for row 1, (SWE,FIN), the PLAY was a DRAW, and I need to use that value to fetch from the DRAW col in df2.
I can manage that by joining the 2 df's, and then some ugly del/rename of columns in a number of steps, but surely I'm missing some more elegant way to do that ? TIA, --Tommy

I think need lookup
df1['New']=df2.set_index('GAME').lookup(df1.GAME,df1.PLAY)
df1
Out[26]:
GAME PLAY BET New
0 (SWE,FIN) DRAW 10 2.0
1 (DEN,GER) WIN 20 2.0
2 (RUS,CZE) LOSS 30 10.0

I like Wen's solution better, but you can use
merged = pd.merge(
pd.concat([df1, pd.get_dummies(df1.PLAY)], axis=1),
df2,
on='GAME')
>>> merged.BET * (merged.DRAW_x * merged.DRAW_y + merged.WIN_x * merged.WIN_y + merged.LOSS_x * merged.LOSS_y)
0 20.0
1 40.0
2 300.0
dtype: float64

Related

Add a column to the original pandas data frame after grouping by 2 columns and taking dot product of two other columns

I have the following data frame in pandas:
I want to add the Avg Price column in the original data frame, after grouping by (Date,Issuer) and then taking the dot product of weights and price, so that it is something like:
Is there a way to do it without using merge or join ? What would be the simplest way to do it?
One way using pandas.DataFrame.prod:
df["Avg Price"] = df[["Weights", "Price"]].prod(1)
df["Avg Price"] = df.groupby(["Date", "Issuer"])["Avg Price"].transform("sum")
print(df)
Output:
Date Issuer Weights Price Avg Price
0 2019-11-12 A 0.4 100 120.0
1 2019-15-12 B 0.5 100 100.0
2 2019-11-12 A 0.2 200 120.0
3 2019-15-12 B 0.3 100 100.0
4 2019-11-12 A 0.4 100 120.0
5 2019-15-12 B 0.2 100 100.0

Count row entries to adjust different column, Pandas

I'm trying to update column entries by counting the frequency of row entries in different columns. Here is a sample of my data. Actual data consistes 10k samples each having length of 220. (220 seconds).
d = {'ID':['a12', 'a12','a12','a12','a12', 'a12', 'a12','a12','v55','v55','v55','v55','v55','v55','v55', 'v55'],
'Exp_A':[0.012,0.154,0.257,0.665,1.072,1.514,1.871,2.144, 0.467, 0.812,1.59,2.151,2.68,3.013,3.514,4.015],
'freq':['00:00:00', '00:00:01', '00:00:02', '00:00:03', '00:00:04',
'00:00:05', '00:00:06', '00:00:07','00:00:00', '00:00:01', '00:00:02', '00:00:03', '00:00:04',
'00:00:05', '00:00:06', '00:00:07'],
'A_Bullseye':[0,0,0,0,1,0,1,0, 0,0,1,0,0,0,1,0], 'A_Bull_Total':[0,0,0,0,0,1,1,2,0,0,0,1,1,1,1,2], 'A_Shot':[0,1,1,1,0,1,0,0, 1,1,0,1,0,1,0,0]}
df = pd.DataFrame(data=d)
Per each second, only Bullseye or Shot can be registered.
Count1: Number of df.A_Shot == 1 before the first df.A_Bullseye == 1 for each ID is 3 & 2 for ID=a12 and ID=v55 resp.
Count2: Number of df.A_Shot == 1 from the end of count1 to the second df.A_Bullseye == 1, 1 for df[df.ID=='a12'] and 2 for df[df.ID=='v55']
Where i in count(i) is df.groupby(by='ID')[A_Bull_Total].max(). Here i is 2.
So, if I can compute the average count for each i, then I will be able to adjust the values of df.Exp_A using the average of the above counts.
mask_A_Shot= df.A_Shot == 1
mask_A_Bullseye= df.A_Bulleseye == 0
mask = mask_A_Shot & mask_A_Bulleseye
df[mask.groupby(df['ID'])].mean()
Ideally I would like to have something like for each i (Bullseye), how many Shots are needed and how many seconds it took.
Create a grouping key of Bullseye within each ID using .cumsum and then you can find how many shots, and how much time elapsed between the bullseyes.
import pandas as pd
df['freq'] = pd.to_timedelta(df.freq, unit='s')
df['Bullseye'] = df.groupby('ID').A_Bullseye.cumsum()+1
# Chop off any shots after the final bullseye
m = df.Bullseye <= df.groupby('ID').A_Bullseye.transform(lambda x: x.cumsum().max())
df[m].groupby(['ID', 'Bullseye']).agg({'A_Shot': 'sum',
'freq': lambda x: x.max()-x.min()})
Output:
A_Shot freq
ID Bullseye
a12 1 3 00:00:03
2 1 00:00:01
v55 1 2 00:00:01
2 2 00:00:03
Edit:
Given your comment, here is how I would proceed. We're going to .shift the Bullseye column so instead of incrementing the counter at the Bullseye, we increment the counter the row after the bullseye. We'll modify A_Shot so bullseyes are also considered to be a shot.
df['freq'] = pd.to_timedelta(df.freq, unit='s')
df['Bullseye'] = df.groupby('ID').A_Bullseye.apply(lambda x: x.shift().cumsum().fillna(0)+1)
# Also consider Bullseye's as a shot:
df.loc[df.A_Bullseye == 1, 'A_Shot'] = 1
# Chop off any shots after the final bullseye
m = df.Bullseye <= df.groupby('ID').A_Bullseye.transform(lambda x: x.cumsum().max())
df1 = (df[m].groupby(['ID', 'Bullseye'])
.agg({'A_Shot': 'sum',
'freq': lambda x: (x.max()-x.min()).total_seconds()}))
Output: df1
A_Shot freq
ID Bullseye
a12 1.0 4 4.0
2.0 2 1.0
v55 1.0 3 2.0
2.0 3 3.0
And now since freq is an integer number of seconds, you can do divisions easily:
df1.A_Shot / df1.freq
#ID Bullseye
#a12 1.0 1.0
# 2.0 2.0
#v55 1.0 1.5
# 2.0 1.0
#dtype: float64

Create multiple dataframe columns containing calculated values from an existing column

I have a dataframe, sega_df:
Month 2016-11-01 2016-12-01
Character
Sonic 12.0 3.0
Shadow 5.0 23.0
I would like to create multiple new columns, by applying a formula for each already existing column within my dataframe (to put it shortly, pretty much double the number of columns). That formula is (100 - [5*eachcell])*0.2.
For example, for November for Sonic, (100-[5*12.0])*0.2 = 8.0, and December for Sonic, (100-[5*3.0])*0.2 = 17.0 My ideal output is:
Month 2016-11-01 2016-12-01 Weighted_2016-11-01 Weighted_2016-12-01
Character
Sonic 12.0 3.0 8.0 17.0
Shadow 5.0 23.0 15.0 -3.0
I know how to create a for loop to create one column. This is for if only one month was in consideration:
for w in range(1,len(sega_df.index)):
sega_df['Weighted'] = (100 - 5*sega_df)*0.2
sega_df[sega_df < 0] = 0
I haven't gotten the skills or experience yet to create multiple columns. I've looked for other questions that may answer what exactly I am doing but haven't gotten anything to work yet. Thanks in advance.
One vectorised approach is to drown to numpy:
A = sega_df.values
A = (100 - 5*A) * 0.2
res = pd.DataFrame(A, index=sega_df.index, columns=('Weighted_'+sega_df.columns))
Then join the result to your original dataframe:
sega_df = sega_df.join(res)

python pandas:get rolling value of one Dataframe by rolling index of another Dataframe

I have two dataframes: one has multi levels of columns, and another has only single level column (which is the first level of the first dataframe, or say the second dataframe is calculated by grouping the first dataframe).
These two dataframes look like the following:
first dataframe-df1
second dataframe-df2
The relationship between df1 and df2 is:
df2 = df1.groupby(axis=1, level='sector').mean()
Then, I get the index of rolling_max of df1 by:
result1=pd.rolling_apply(df1,window=5,func=lambda x: pd.Series(x).idxmax(),min_periods=4)
Let me explain result1 a little bit. For example, during the five days (window length) 2016/2/23 - 2016/2/29, the max price of the stock sh600870 happened in 2016/2/24, the index of 2016/2/24 in the five-day range is 1. So, in result1, the value of stock sh600870 in 2016/2/29 is 1.
Now, I want to get the sector price for each stock by the index in result1.
Let's take the same stock as example, the stock sh600870 is in sector ’家用电器视听器材白色家电‘. So in 2016/2/29, I wanna get the sector price in 2016/2/24, which is 8.770.
How can I do that?
idxmax (or np.argmax) returns an index which is relative to the rolling
window. To make the index relative to df1, add the index of the left edge of
the rolling window:
index = pd.rolling_apply(df1, window=5, min_periods=4, func=np.argmax)
shift = pd.rolling_min(np.arange(len(df1)), window=5, min_periods=4)
index = index.add(shift, axis=0)
Once you have ordinal indices relative to df1, you can use them to index
into df1 or df2 using .iloc.
For example,
import numpy as np
import pandas as pd
np.random.seed(2016)
N = 15
columns = pd.MultiIndex.from_product([['foo','bar'], ['A','B']])
columns.names = ['sector', 'stock']
dates = pd.date_range('2016-02-01', periods=N, freq='D')
df1 = pd.DataFrame(np.random.randint(10, size=(N, 4)), columns=columns, index=dates)
df2 = df1.groupby(axis=1, level='sector').mean()
window_size, min_periods = 5, 4
index = pd.rolling_apply(df1, window=window_size, min_periods=min_periods, func=np.argmax)
shift = pd.rolling_min(np.arange(len(df1)), window=window_size, min_periods=min_periods)
# alternative, you could use
# shift = np.pad(np.arange(len(df1)-window_size+1), (window_size-1, 0), mode='constant')
# but this is harder to read/understand, and therefore it maybe more prone to bugs.
index = index.add(shift, axis=0)
result = pd.DataFrame(index=df1.index, columns=df1.columns)
for col in index:
sector, stock = col
mask = pd.notnull(index[col])
idx = index.loc[mask, col].astype(int)
result.loc[mask, col] = df2[sector].iloc[idx].values
print(result)
yields
sector foo bar
stock A B A B
2016-02-01 NaN NaN NaN NaN
2016-02-02 NaN NaN NaN NaN
2016-02-03 NaN NaN NaN NaN
2016-02-04 5.5 5 5 7.5
2016-02-05 5.5 5 5 8.5
2016-02-06 5.5 6.5 5 8.5
2016-02-07 5.5 6.5 5 8.5
2016-02-08 6.5 6.5 5 8.5
2016-02-09 6.5 6.5 6.5 8.5
2016-02-10 6.5 6.5 6.5 6
2016-02-11 6 6.5 4.5 6
2016-02-12 6 6.5 4.5 4
2016-02-13 2 6.5 4.5 5
2016-02-14 4 6.5 4.5 5
2016-02-15 4 6.5 4 3.5
Note in pandas 0.18 the rolling_apply syntax was changed. DataFrames and Series now have a rolling method, so that now you would use:
index = df1.rolling(window=window_size, min_periods=min_periods).apply(np.argmax)
shift = (pd.Series(np.arange(len(df1)))
.rolling(window=window_size, min_periods=min_periods).min())
index = index.add(shift.values, axis=0)

Efficient operation over grouped dataframe Pandas

I have a very big Pandas dataframe where I need an ordering within groups based on another column. I know how to iterate over groups, do an operation on the group and union all those groups back into one dataframe however this is slow and I feel like there is a better way achieve this. Here is the input and what I want out of it. Input:
ID price
1 100.00
1 80.00
1 90.00
2 40.00
2 40.00
2 50.00
Output:
ID price order
1 100.00 3
1 80.00 1
1 90.00 2
2 40.00 1
2 40.00 2 (could be 1, doesn't matter too much)
2 50.00 3
Since this is over about 5kk records with around 250,000 IDs efficiency is important.
If speed is what you want, then the following should be pretty good, although it is a bit more complicated as it makes use of complex number sorting in numpy. This is similar to the approach used (my me) when writing the aggregate-sort method in the package numpy-groupies.
# get global sort order, for sorting by ID then price
full_idx = np.argsort(df['ID'] + 1j*df['price'])
# get min of full_idx for each ID (note that there are multiple ways of doing this)
n_for_id = np.bincount(df['ID'])
first_of_idx = np.cumsum(n_for_id)-n_for_id
# subtract first_of_idx from full_idx
rank = np.empty(len(df),dtype=int)
rank[full_idx] = arange(len(df)) - first_of_idx[df['ID'][full_idx]]
df['rank'] = rank+1
It takes 2s for 5m rows on my machine, which is about 100x faster than using groupby.rank from pandas (although I didn't actually run the pandas version with 5m rows because it would take too long; I'm not sure how #ayhan managed to do it in only 30s, perhaps a difference in pandas versions?).
If you do use this, then I recommend testing it thoroughly, as I have not.
You can use rank:
df["order"] = df.groupby("ID")["price"].rank(method="first")
df
Out[47]:
ID price order
0 1 100.0 3.0
1 1 80.0 1.0
2 1 90.0 2.0
3 2 40.0 1.0
4 2 40.0 2.0
5 2 50.0 3.0
It takes about 30s on a dataset of 5m rows with 250000 ID's (i5-3330) :
df = pd.DataFrame({"price": np.random.rand(5000000), "ID": np.random.choice(np.arange(250000), size = 5000000)})
%time df["order"] = df.groupby("ID")["price"].rank(method="first")
Wall time: 36.3 s

Categories