Walking average based on two matching columns - python

I have a dataframe df of the following format:
team1 team2 score1 score2
0 1 2 1 0
1 3 4 3 0
2 1 3 1 1
3 2 4 0 2
4 1 2 3 2
What I want to do is to create a new column that will return rolling average of the score1 column of last 3 games but only when the two teams from team1 and team2 are matching.
Expected output:
team1 team2 score1 score2 new
0 1 2 1 0 1
1 3 4 3 0 3
2 1 3 1 1 1
3 2 4 0 2 0
4 1 2 3 2 2
I was able to calculate walking average for all games for each team separately like that:
df['new'] = df.groupby('team1')['score1'].transform(lambda x: x.rolling(3, min_periods=1).mean()
but cannot find a sensible way to expand that to match two teams.
I tried the code below that returns... something, but definitely not what I need.
df['new'] = df.groupby(['team1','team2'])['score1'].transform(lambda x: x.rolling(3, min_periods=1).mean()
I suppose this could be done with apply() but I want to avoid it due to performace issues.

Not sure what is your exact expected output, but you can first reshape the DataFrame to a long format:
(pd.wide_to_long(df.reset_index(), ['team', 'score'], i='index', j='x')
.groupby('team')['score']
.rolling(3, min_periods=1).mean()
)
Output:
team index x
1 0 1 1.0
2 1 1.0
2 3 1 0.0
0 2 0.0
3 1 1 3.0
2 2 2.0
4 1 2 0.0
3 2 1.0
Name: score, dtype: float64

The walkaround I've found was to create 'temp' column that merges the values in 'team1' and 'team2' and uses that column as a reference for the rolling average.
df['temp'] = df.team1+'_'+df.team2
df['new'] = df.groupby('temp')['score1'].transform(lambda x: x.rolling(3, min_periods=1).mean()
Can this be done in one line?

Related

Realise accumulated DataFrame from a column of Boolean values

Be the following python pandas DataFrame:
ID
Holidays
visit_1
visit_2
visit_3
other
0
True
1
2
0
red
0
False
3
2
0
red
0
True
4
4
1
blue
1
False
2
0
0
red
1
True
1
2
1
green
2
False
1
0
0
red
Currently I calculate a new DataFrame with the accumulated visit values as follows.
# Calculate the columns of the total visit count
visit_df = df.groupby('ID')[['visit_1', 'visit_2', 'visit_3']].sum()
I would like to create a new one taking into account only the rows whose Holiday value is True. How could I do this?
Simple subset the rows first:
df[df['Holidays']].groupby('ID')[['visit_1', 'visit_2', 'visit_3']].sum()
output:
visit_1 visit_2 visit_3
ID
0 5 6 1
1 1 2 1
Alternative if you want to also get the groups without any match:
df2 = df.set_index('ID')
(df2.where(df2['Holidays'])
.groupby('ID')[['visit_1', 'visit_2', 'visit_3']].sum()
)
output:
visit_1 visit_2 visit_3
ID
0 5.0 6.0 1.0
1 1.0 2.0 1.0
2 0.0 0.0 0.0
variant
df2 = df.set_index('ID')
(df2.where(df2['Holidays'])
.groupby('ID')[['visit_1', 'visit_2', 'visit_3']].sum()
.convert_dtypes()
.add_suffix('_Holidays')
)
output:
visit_1_Holidays visit_2_Holidays visit_3_Holidays
ID
0 5 6 1
1 1 2 1
2 0 0 0

average of specific columns and storing them in new column

What am I doing wrong here? I have a dataframe where I am adding two new columns the first creates a count by adding all the values in each column to the right that are equal to 1. That part works fine. The next part of the code should give the average of all the values to the right that are not equal to 0. For some reason it is also taking the values to the left into account. Here is the code. Thanks for any help.
I have tried my code as well as both solutions below and am still getting the wrong average. Here's a simplified version with a random dataframe, and all three versions of the code. I have removed values to the left and still have the issue of the average being wrong. Maybe this will make help.
Version 1:
df = pd.DataFrame(np.random.randint(0,3,size=(10, 10)), columns=list('ABCDEFGHIJ'))
idx_last = len(df.columns)
df.insert(loc=0, column='new', value=df[df[0:(idx_last+1)]==1].sum(axis=1))
idx_last = len(df.columns)
df.insert(loc=1, column='avg', value=df[df[0:(idx_last+1)]!=0].mean(axis=1))
df
Version 2:
df = pd.DataFrame(np.random.randint(0,3,size=(10, 10)), columns=list('ABCDEFGHIJ'))
df.insert(loc=0, column='new', value=(df.iloc[:, 0:]==1).sum(axis=1))
df.insert(loc=1, column='avg', value=(df.iloc[:, 1:]!=0).mean(axis=1))
df
Version 3:
df = pd.DataFrame(np.random.randint(0,3,size=(10, 10)), columns=list('ABCDEFGHIJ'))
idx_last = len(df.columns)
loc_value=0
df.insert(loc=loc_value, column='new', value=df[df[loc_value:(idx_last+1)]==1].sum(axis=1))
idx_last = len(df.columns)
loc_value=1
df.insert(loc=loc_value, column='avg', value=df[df[loc_value: (idx_last+1)]!=0].sum(axis=1))
df
I believe you need DataFrame.iloc function for get columns by positions, because is added new column is necessary use position + 1 for avg column with DataFrame.where for replace non matched values to missing values:
np.random.seed(123)
df = pd.DataFrame(np.random.randint(0,3,size=(10, 5)), columns=list('ABCDE'))
df.insert(loc=0, column='new', value=(df.iloc[:, 0:]==1).sum(axis=1))
df.insert(loc=1, column='avg', value=(df.iloc[:, 1:].where(df.iloc[:, 1:]!=0)).mean(axis=1))
print (df)
new avg A B C D E
0 1 1.750000 2 1 2 2 0
1 2 1.600000 2 2 1 2 1
2 2 1.500000 2 1 0 1 2
3 2 1.333333 1 0 2 0 1
4 1 1.500000 2 1 0 0 0
5 1 1.666667 0 1 2 0 2
6 2 1.000000 0 0 1 0 1
7 1 1.500000 0 0 0 2 1
8 2 1.600000 1 2 2 2 1
9 1 1.500000 0 0 2 1 0
Or use helper DataFrame in df1 variable:
np.random.seed(123)
df = pd.DataFrame(np.random.randint(0,3,size=(10, 5)), columns=list('ABCDE'))
df1 = df.copy()
df.insert(loc=0, column='new', value=(df1==1).sum(axis=1))
df.insert(loc=1, column='avg', value=df1.where(df1!=0).mean(axis=1))
print (df)
new avg A B C D E
0 1 1.750000 2 1 2 2 0
1 2 1.600000 2 2 1 2 1
2 2 1.500000 2 1 0 1 2
3 2 1.333333 1 0 2 0 1
4 1 1.500000 2 1 0 0 0
5 1 1.666667 0 1 2 0 2
6 2 1.000000 0 0 1 0 1
7 1 1.500000 0 0 0 2 1
8 2 1.600000 1 2 2 2 1
9 1 1.500000 0 0 2 1 0
The issue arises with the expression, (df.iloc[:, 1:]!=0).mean(axis=1). It is because df.iloc[:, 1:]!=0 will return a matrix of booleans, as it is a comparing expression. Taking a mean of such values will not give the mean of original values, as the maximum value in such matrix will anyway be 1.
Hence, the following would do the job (note the indexing as well)
df = pd.DataFrame(np.random.randint(0,3,size=(10, 10)), columns=list('ABCDEFGHIJ'))
df.insert(loc=0, column='new', value=(df.iloc[:, 0:]==1).sum(axis=1))
df.insert(loc=1, column='avg', value=(df.iloc[:, 1:]!=0).sum(axis=1)) #just keeping the count of non zeros
df["avg"]=df.iloc[:, 2:].sum(axis=1)/df["avg"]

How could I replace null value In a group?

I created this dataframe I calculated the gap that I was looking but the problem is that some flats have the same price and I get a difference of price of 0. How could I replace the value 0 by the difference with the last lower price of the same group.
for example:
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:3
neighboorhood:a, bed:1, bath:1, price:2
I get difference price of 0,2,1,nan and I'm looking for 2,2,1,nan (briefly I don't want to compare 2 flats with the same price)
Thanks in advance and good day.
data=[
[1,'a',1,1,5],[2,'a',1,1,5],[3,'a',1,1,4],[4,'a',1,1,2],[5,'b',1,2,6],[6,'b',1,2,6],[7,'b',1,2,3]
]
df = pd.DataFrame(data, columns = ['id','neighborhoodname', 'beds', 'baths', 'price'])
df['difference_price'] = ( df.dropna()
.sort_values('price',ascending=False)
.groupby(['city','beds','baths'])['price'].diff(-1) )
I think you can remove duplicates first per all columns used for groupby with diff, create new column in filtered data and last use merge with left join to original:
df1 = (df.dropna()
.sort_values('price',ascending=False)
.drop_duplicates(['neighborhoodname','beds','baths', 'price']))
df1['difference_price'] = df1.groupby(['neighborhoodname','beds','baths'])['price'].diff(-1)
df = df.merge(df1[['neighborhoodname','beds','baths','price', 'difference_price']], how='left')
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN
Or you can use lambda function for back filling 0 values per groups for avoid wrong outputs if one row groups (data moved from another groups):
df['difference_price'] = (df.sort_values('price',ascending=False)
.groupby(['neighborhoodname','beds','baths'])['price']
.apply(lambda x: x.diff(-1).replace(0, np.nan).bfill()))
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN

Comparing two columns of the dataframe against a new dataframe column

I have two dataframes which looks like
team points
0 1 2.5
1 2 3.2
2 5 5.8
3 3 2.8
4 4 1.9
and:
team1 team2
0 1 5
1 2 4
2 3 1
Expected output should give me a new column with the winner (more points) :
team1 team2 winner
1 5 5
2 4 2
3 1 3
Trying to avoid applymap and use lookup+reshape
x = df.set_index('team').lookup(df2.values.ravel('F'), ["points"]*df2.size)
.reshape(df2.shape, order='F')
.argmax(1)
df2['winner'] = df2.lookup(df2.index, df2.columns[x])
team1 team2 winner
0 1 5 5
1 2 4 2
2 3 1 3
here is a way using applymap,df.idxmax() and df.lookup:
s=df2.applymap(df1.set_index('team')['points'].get).idxmax(1)
Or better alternative courtesy #user3483203
s=df2.stack().map(df1.set_index('team')['points']).unstack().idxmax(1)
#s.tolist() gives ['team2', 'team1', 'team1']
df2['winner']=df2.lookup(s.index,s)
print(df2)
team1 team2 winner
0 1 5 5
1 2 4 2
2 3 1 3
My "simple" solution:
df3= df2.replace(df1.set_index("team").points.to_dict())
df2["winner"]= np.where(df3.team1>=df3.team2,df2.team1,df2.team2)
Alternative solution only using pandas.Series.map , DataFrame.stack and DataFrame.unstack:
df_match['winner']=( df_match.stack()
.map(df.set_index('team')['points'])
.unstack()
.max(axis=1)
.map(df.set_index('points')['team']) )
print(df_match)
team1 team2 winner
0 1 5 5
1 2 4 2
2 3 1 3

Slicing a dataframe on column names or alternative column name if they are not available

I am looking for a way to eliminate key errors that are caused by different column names in the data that gets loaded. So for example I might have columns like
dummy_df = pd.DataFrame(np.random.randint(0,5,size=(5, 2)), columns=['Test','Test_v2'])
Test Test_v2
0 0 3
1 0 0
2 1 2
3 4 0
4 4 4
How can I do s.th. like
dummy_df[ if_avail('Test') otherwise 'Test_v2']
It would be nice to be able passing a list, where it starts checking for existence in item order.
I think you can check columns names and select first matched column:
L = ['Test_v1','Test','Test_v2']
m = dummy_df.columns.isin(L)
first = dummy_df.columns[m].values[0]
s = dummy_df[first]
print (s)
0 3
1 2
2 3
3 0
4 0
Name: Test, dtype: int32
Another solution is:
print (dummy_df.reindex(columns=L).dropna(axis=1, how='all').iloc[:, 0])
0 3
1 2
2 3
3 0
4 0
Name: Test, dtype: int32
Explanation:
First reindex by list of columns names:
print (dummy_df.reindex(columns=L))
Test_v1 Test Test_v2
0 NaN 3 2
1 NaN 2 3
2 NaN 3 1
3 NaN 0 0
4 NaN 0 2
And remove all columns with all NaNs:
print (dummy_df.reindex(columns=L).dropna(axis=1, how='all'))
Test Test_v2
0 3 2
1 2 3
2 3 1
3 0 0
4 0 2
And last select first column by iloc:
print (dummy_df.reindex(columns=L).dropna(axis=1, how='all').iloc[:, 0])0 3
1 2
2 3
3 0
4 0
Name: Test, dtype: int32

Categories