Comparing two columns of the dataframe against a new dataframe column - python

I have two dataframes which looks like
team points
0 1 2.5
1 2 3.2
2 5 5.8
3 3 2.8
4 4 1.9
and:
team1 team2
0 1 5
1 2 4
2 3 1
Expected output should give me a new column with the winner (more points) :
team1 team2 winner
1 5 5
2 4 2
3 1 3

Trying to avoid applymap and use lookup+reshape
x = df.set_index('team').lookup(df2.values.ravel('F'), ["points"]*df2.size)
.reshape(df2.shape, order='F')
.argmax(1)
df2['winner'] = df2.lookup(df2.index, df2.columns[x])
team1 team2 winner
0 1 5 5
1 2 4 2
2 3 1 3

here is a way using applymap,df.idxmax() and df.lookup:
s=df2.applymap(df1.set_index('team')['points'].get).idxmax(1)
Or better alternative courtesy #user3483203
s=df2.stack().map(df1.set_index('team')['points']).unstack().idxmax(1)
#s.tolist() gives ['team2', 'team1', 'team1']
df2['winner']=df2.lookup(s.index,s)
print(df2)
team1 team2 winner
0 1 5 5
1 2 4 2
2 3 1 3

My "simple" solution:
df3= df2.replace(df1.set_index("team").points.to_dict())
df2["winner"]= np.where(df3.team1>=df3.team2,df2.team1,df2.team2)

Alternative solution only using pandas.Series.map , DataFrame.stack and DataFrame.unstack:
df_match['winner']=( df_match.stack()
.map(df.set_index('team')['points'])
.unstack()
.max(axis=1)
.map(df.set_index('points')['team']) )
print(df_match)
team1 team2 winner
0 1 5 5
1 2 4 2
2 3 1 3

Related

Walking average based on two matching columns

I have a dataframe df of the following format:
team1 team2 score1 score2
0 1 2 1 0
1 3 4 3 0
2 1 3 1 1
3 2 4 0 2
4 1 2 3 2
What I want to do is to create a new column that will return rolling average of the score1 column of last 3 games but only when the two teams from team1 and team2 are matching.
Expected output:
team1 team2 score1 score2 new
0 1 2 1 0 1
1 3 4 3 0 3
2 1 3 1 1 1
3 2 4 0 2 0
4 1 2 3 2 2
I was able to calculate walking average for all games for each team separately like that:
df['new'] = df.groupby('team1')['score1'].transform(lambda x: x.rolling(3, min_periods=1).mean()
but cannot find a sensible way to expand that to match two teams.
I tried the code below that returns... something, but definitely not what I need.
df['new'] = df.groupby(['team1','team2'])['score1'].transform(lambda x: x.rolling(3, min_periods=1).mean()
I suppose this could be done with apply() but I want to avoid it due to performace issues.
Not sure what is your exact expected output, but you can first reshape the DataFrame to a long format:
(pd.wide_to_long(df.reset_index(), ['team', 'score'], i='index', j='x')
.groupby('team')['score']
.rolling(3, min_periods=1).mean()
)
Output:
team index x
1 0 1 1.0
2 1 1.0
2 3 1 0.0
0 2 0.0
3 1 1 3.0
2 2 2.0
4 1 2 0.0
3 2 1.0
Name: score, dtype: float64
The walkaround I've found was to create 'temp' column that merges the values in 'team1' and 'team2' and uses that column as a reference for the rolling average.
df['temp'] = df.team1+'_'+df.team2
df['new'] = df.groupby('temp')['score1'].transform(lambda x: x.rolling(3, min_periods=1).mean()
Can this be done in one line?

How could I replace null value In a group?

I created this dataframe I calculated the gap that I was looking but the problem is that some flats have the same price and I get a difference of price of 0. How could I replace the value 0 by the difference with the last lower price of the same group.
for example:
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:3
neighboorhood:a, bed:1, bath:1, price:2
I get difference price of 0,2,1,nan and I'm looking for 2,2,1,nan (briefly I don't want to compare 2 flats with the same price)
Thanks in advance and good day.
data=[
[1,'a',1,1,5],[2,'a',1,1,5],[3,'a',1,1,4],[4,'a',1,1,2],[5,'b',1,2,6],[6,'b',1,2,6],[7,'b',1,2,3]
]
df = pd.DataFrame(data, columns = ['id','neighborhoodname', 'beds', 'baths', 'price'])
df['difference_price'] = ( df.dropna()
.sort_values('price',ascending=False)
.groupby(['city','beds','baths'])['price'].diff(-1) )
I think you can remove duplicates first per all columns used for groupby with diff, create new column in filtered data and last use merge with left join to original:
df1 = (df.dropna()
.sort_values('price',ascending=False)
.drop_duplicates(['neighborhoodname','beds','baths', 'price']))
df1['difference_price'] = df1.groupby(['neighborhoodname','beds','baths'])['price'].diff(-1)
df = df.merge(df1[['neighborhoodname','beds','baths','price', 'difference_price']], how='left')
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN
Or you can use lambda function for back filling 0 values per groups for avoid wrong outputs if one row groups (data moved from another groups):
df['difference_price'] = (df.sort_values('price',ascending=False)
.groupby(['neighborhoodname','beds','baths'])['price']
.apply(lambda x: x.diff(-1).replace(0, np.nan).bfill()))
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN

Pandas join DataFrame and Series over a column

I have a Pandas DataFrame df that store a matching between a label and an integer, and a Pandas Series s that contains a sequence of labels :
print(df)
label id
0 AAAAAAAAA 0
1 BBBBBBBBB 1
2 CCCCCCCCC 2
3 DDDDDDDDD 3
4 EEEEEEEEE 4
print(s)
0 AAAAAAAAA
1 BBBBBBBBB
2 CCCCCCCCC
3 CCCCCCCCC
4 EEEEEEEEE
5 EEEEEEEEE
6 DDDDDDDDD
I want to join this DataFrame and this Series, to get the sequence of integer corresponding to my sequence s.
Here is the expected result of my example :
print(df.join(s)["id"])
0 0
1 1
2 2
3 2
4 4
5 4
6 3
Use Series.map with Series:
print (s.map(df.set_index('label')['id']))
0 0
1 1
2 2
3 2
4 4
5 4
6 3
Name: a, dtype: int64
Alternative - be careful, if dupes no error but return last dupe row:
print (s.map(dict(zip(df['label'], df['id']))))

Pandas: Create a new column by comparing 2 columns in 2 different data frames

I've 2 data frames in pandas.
in_degree:
Target in_degree
0 2 1
1 4 24
2 5 53
3 6 98
4 7 34
out_degree
Source out_degree
0 1 4
1 2 4
2 3 5
3 4 5
4 5 5
By comparing 2 columns, I'd like to create a new data frame which should add columns "in_degree" and "out_degree" and display the result.
The Sample output should look like
Source/Target out_degree
0 1 4
1 2 5
2 3 5
3 4 29
4 5 58
Any help would be appreciated.
Thanks.
Traditionally, this would need a merge, but I think you can take advantage of pandas' index aligned arithmetic to do this a bit faster.
x = df2.set_index('Source')
y = df1.set_index('Target').rename_axis('Source')
y.columns = x.columns
x.add(y.reindex(x.index), fill_value=0).reset_index()
Source out_degree
0 1 4.0
1 2 5.0
2 3 5.0
3 4 29.0
4 5 58.0
The "traditional" SQL way of solving this would be using merge:
v = df1.merge(df2, left_on='Target', right_on='Source', how='right')
dct = dict(
Source=v['Source'],
out_degree=v['in_degree'].add(v['out_degree'], fill_value=0))
pd.DataFrame(dct).sort_values('Source')
Source out_degree
3 1 4.0
0 2 5.0
4 3 5.0
1 4 29.0
2 5 58.0

Pandas - create total column based on other column

I'm trying to create a total column that sums the numbers from another column based on a third column. I can do this by using .groupby(), but that creates a truncated column, whereas I want a column that is the same length.
My code:
df = pd.DataFrame({'a':[1,2,2,3,3,3], 'b':[1,2,3,4,5,6]})
df['total'] = df.groupby(['a']).sum().reset_index()['b']
My result:
a b total
0 1 1 1.0
1 2 2 5.0
2 2 3 15.0
3 3 4 NaN
4 3 5 NaN
5 3 6 NaN
My desired result:
a b total
0 1 1 1.0
1 2 2 5.0
2 2 3 5.0
3 3 4 15.0
4 3 5 15.0
5 3 6 15.0
...where each 'a' column has the same total as the other.
Returning the sum from a groupby operation in pandas produces a column only as long as the number of unique items in the index. Use transform to produce a column of the same length ("like-indexed") as the original data frame without performing any merges.
df['total'] = df.groupby('a')['b'].transform(sum)
>>> df
a b total
0 1 1 1
1 2 2 5
2 2 3 5
3 3 4 15
4 3 5 15
5 3 6 15

Categories