Create a new pandas columns from multiple columns - python

Here is the dataframe
MatchId EventCodeId EventCode Team1 Team2 Team1_Goals Team2_Goals xG_Team1 xG_Team2 CurrentPlaytime
0 865314 1029 Goal Home Northampton Crawley Town 2 2 2.067663207769023 0.8130662505484256 457040
1 865314 1029 Goal Home Northampton Crawley Town 2 2 2.067663207769023 0.8130662505484256 1405394
2 865314 2053 Goal Away Northampton Crawley Town 2 2 2.067663207769023 0.8130662505484256 1898705
3 865314 2053 Goal Away Northampton Crawley Town 2 2 2.067663207769023 0.8130662505484256 4388278
4 865314 1029 Goal Home Northampton Crawley Town 2 2 2.067663207769023 0.8130662505484256 4507898
5 865314 1030 Cancel Goal Home Northampton Crawley Town 2 2 2.067663207769023 0.8130662505484256 4517728
6 865314 1029 Goal Home Northampton Crawley Town 2 2 2.067663207769023 0.8130662505484256 4956346
7 865314 1030 Cancel Goal Home Northampton Crawley Town 2 2 2.067663207769023 0.8130662505484256 4960633
8 865316 2053 Goal Away Coventry Bradford 0 0 1.0847662440468118 1.2526705617472387 447858
9 865316 2054 Cancel Goal Away Coventry Bradford 0 0 1.0847662440468118 1.2526705617472387 456361
The new columns will be created as follows:
for EventCodeId = 1029 and EventCode = Goal Home
new_col1 = CurrentPlaytime/3*10**4
for EventCodeId = 2053 and ventCode = Goal Away
new_col2 = CurrentPlaytime/3*10**4
For every other EventCodeId and EventCode new_co1 and new_col2 will take 0.
Here is how I have started but couldn't go any further. please help
new_col1 = []
new_col2 = []
def timeslot(EventCodeId, EventCode, CurrentPlaytime):
if x == 1029 and y == 'Goal Home':
new.Col1.append(z/(3*10**4))
elif x == 2053 and y == 'Goal Away':
new_col2.append(z/(3*10**4))
else:
new_col1.append(0)
new_col2.append(0)
return new_col1
return new_col2
df1['new_col1', 'new_col2'] = df1.apply(lambda x,y,z: timeslot(x['EventCodeId'], y['EventCode'], z['CurrentPlaytime']), axis=1)
TypeError: ("<lambda>() missing 2 required positional arguments: 'y' and 'z'", 'occurred at index 0')

You do not need an explicit loop. Use vectorised operations where possible.
Using numpy.where:
s = df1['CurrentPlaytime']/3*10**4
mask1 = (df1['EventCodeId'] == 1029) & (df1['EventCode'] == 'Goal')
mask2 = (df1['EventCodeId'] == 2053) & (df1['EventCode'] == 'Away')
df1['new_col1'] = np.where(mask1, s, 0)
df1['new_col2'] = np.where(mask2, s, 0)

Related

Column Mapping using Python

I have two dataframes , the first one has 1000 rows and looks like:
Date tri23_1 hsgç_T2 bbbj-1Y_jn Family Bonus
2011-06-09 qwer 1 rits Laavin 456
2011-07-09 ww 43 mayo Grendy 679
2011-09-10 wwer 44 ramya Fantol 431
2011-11-02 5 sam Gondow 569
The second dataframe contains all the unique values and also the hotels, that are associated to these values:
Group Hotel
tri23_1 Jamel
hsgç_T2 Frank
bbbj-1Y_jn Luxy
mlkl_781 Grand Hotel
vchs_94 Vancouver
My goal is to replace the columns of the first dataframe by the the corresponding values of the column Hotel of the second dataframe and the output should look like below:-
Date Jamel Frank Luxy Family Bonus
2011-06-09 qwer 1 rits Laavin 456
2011-07-09 ww 43 mayo Grendy 679
2011-09-10 wwer 44 ramya Fantol 431
2011-11-02 5 sam Gondow 569
Can i achieve this using python.
You could try this, using to_dict():
df1.columns=[df2.set_index('Group').to_dict()['Hotel'][i] if i in df2.set_index('Group').to_dict()['Hotel'].keys() else i for i in df1.columns]
print(df1)
Output:
df1
Date tri23_1 hsgç_T2 bbbj-1Y_jn Family Bonus
0 2011-06-09 qwer 1 rits Laavin 456.0
1 2011-07-09 ww 43 mayo Grendy 679.0
2 2011-09-10 wwer 44 ramya Fantol 431.0
3 2011-11-02 5 sam Gondow 569
df2
Group Hotel
0 tri23_1 Jamel
1 hsgç_T2 Frank
2 bbbj-1Y_jn Luxy
3 mlkl_781 Grand Hotel
4 vchs_94 Vancouver
df1 changed
Date Jamel Frank Luxy Family Bonus
0 2011-06-09 qwer 1 rits Laavin 456.0
1 2011-07-09 ww 43 mayo Grendy 679.0
2 2011-09-10 wwer 44 ramya Fantol 431.0
3 2011-11-02 5 sam Gondow 569
Update: Explanation
First, if df2['Group'] isn't the index of df2, we set it as index.
Then pass the dataframe to a dict:
df2.set_index('Group').to_dict()
>>>{'Hotel': {'tri23_1': 'Jamel', 'hsgç_T2': 'Frank', 'bbbj-1Y_jn': 'Luxy', 'mlkl_781': 'Grand Hotel', 'vchs_94': 'Vancouver'}}
Then we select the value of key 'Hotel'
df2.set_index('Group').to_dict()['Hotel']
>>>{'tri23_1': 'Jamel', 'hsgç_T2': 'Frank', 'bbbj-1Y_jn': 'Luxy', 'mlkl_781': 'Grand Hotel', 'vchs_94': 'Vancouver'}
Then column by column we search its value in that dictionary, and if such column doesn't exit in the keys of the dictionary, we just return the same value e.g. Date, Family, Bonus:
i='Date'
i in df2.set_index('Group').to_dict()['Hotel'].keys --->False
return 'Date'
...
i='tri23_1'
i in df2.set_index('Group').to_dict()['Hotel'].keys --->True
return df2.set_index('Group').to_dict()['Hotel']['tri23_1']
...
...
#And so on...

create two columns based on a function with apply()

I have a dataset containing football data of the premier league as such:
HomeTeam AwayTeam FTHG FTAG
0 Liverpool Norwich 4 1
1 West Ham Man City 0 5
2 Bournemouth Sheffield United 1 1
3 Burnley Southampton 3 0
... ... ... ... ...
where "FTHG" and "FTAG" are full-time home team goals and away team goals.
I need to write a function that calculates the final Premier League table given the results (in the form of a data frame). What I wrote is this function:
def calcScore(row):
if PL_df.iloc[row]['FTHG'] > PL_df.iloc[row]['FTAG']:
x = 3
y = 0
elif PL_df.iloc[row]['FTHG'] < PL_df.iloc[row]['FTAG']:
x = 0
y = 3
elif PL_df.iloc[row]['FTHG'] == PL_df.iloc[row]['FTAG']:
x = 1
y = 1
return x,y
this works, for example for the first row it gives this output:
in[1]: calcScore(0)
out[1]: (3,0)
now I need to create two columns HP and AP that contain the number of points awarded for Home and Away teams respectively using apply(). But I can't think of a way to do that.
I hope I was clear enough. Thank you in advance.
No need for a function (and also faster than apply):
win_or_draws = df['FTHG'] > df['FTAG'], df['FTHG'] == df['FTAG']
df['HP'] = np.select( win_or_draws, (3,1), 0)
df['AP'] = np.select(win_or_draws, (0,1),3)
Output:
HomeTeam AwayTeam FTHG FTAG HP AP
0 Liverpool Norwich 4 1 3 0
1 West Ham Man City 0 5 0 3
2 Bournemouth Sheffield United 1 1 1 1
3 Burnley Southampton 3 0 3 0

pandas - how to extract top three rows from the dataframe provided

My pandas Data frame df could produce result as below:
grouped = df[(df['X'] == 'venture') & (df['company_code'].isin(['TDS','XYZ','UVW']))].groupby(['company_code','sector'])['X_sector'].count()
The output of this is as follows:
company_code sector
TDS Meta 404
Electrical 333
Mechanical 533
Agri 453
XYZ Sports 331
Electrical 354
Movies 375
Manufacturing 355
UVW Sports 505
Robotics 345
Movies 56
Health 3263
Manufacturing 456
Others 524
Name: X_sector, dtype: int64
What I want to get is the top three sectors within the company codes.
What is the way to do it?
You will have to chain a groupby here. Consider this example:
import pandas as pd
import numpy as np
np.random.seed(111)
names = [
'Robert Baratheon',
'Jon Snow',
'Daenerys Targaryen',
'Theon Greyjoy',
'Tyrion Lannister'
]
df = pd.DataFrame({
'season': np.random.randint(1, 7, size=100),
'actor': np.random.choice(names, size=100),
'appearance': 1
})
s = df.groupby(['season','actor'])['appearance'].count()
print(s.sort_values(ascending=False).groupby('season').head(1)) # <-- head(3) for 3 values
Returns:
season actor
4 Daenerys Targaryen 7
6 Robert Baratheon 6
3 Robert Baratheon 6
5 Jon Snow 5
2 Theon Greyjoy 5
1 Jon Snow 4
Where s is (clipped at 4)
season actor
1 Daenerys Targaryen 2
Jon Snow 4
Robert Baratheon 2
Theon Greyjoy 3
Tyrion Lannister 4
2 Daenerys Targaryen 4
Jon Snow 3
Robert Baratheon 1
Theon Greyjoy 5
Tyrion Lannister 3
3 Daenerys Targaryen 2
Jon Snow 1
Robert Baratheon 6
Theon Greyjoy 3
Tyrion Lannister 3
4 ...
Why would you want things to be complicated, when there are simple codes possible:
Z = df.groupby('country_code')['sector'].value_counts().groupby(level=0).head(3).sort_values(ascending=False).to_frame('counts').reset_index()
Z

Create column with multiple data frames and multiple conditions

I am looking at football data and trying to add an opponent column, but am struggling with the way that the data frames are organized.
****EDIT****
defense = {'week': [1, 1, 1, 1, 2, 2, 2, 2], 'team': ['GB', 'MIA', 'CHI', 'DET', 'GB', 'MIA', 'CHI', 'DET']}
games = {'week': [1, 1, 2, 2], 'winner': ['GB', 'MIA', 'GB', 'DET'], 'loser': ['CHI', 'DET', 'MIA', 'CHI']}
def_df = pd.DataFrame(data=defense)
games_df = pd.DataFrame(data=games)
def_df
team week
0 GB 1
1 MIA 1
2 CHI 1
3 DET 1
4 GB 2
5 MIA 2
6 CHI 2
7 DET 2
games_df
loser week winner
0 CHI 1 GB
1 DET 1 MIA
2 MIA 2 GB
3 CHI 2 DET
I am looking to add an defense['Opponent'] column based on that week.
team week Opponent
0 GB 1 CHI
1 MIA 1 DET
2 CHI 1 GB
3 DET 1 MIA
4 GB 2 MIA
5 MIA 2 GB
6 CHI 2 DET
7 DET 2 CHI
Thanks!
Here's one way using a nested dictionary mapping:
from collections import defaultdict
d = defaultdict(dict)
for row in games_df.itertuples(index=False):
d[row.week].update({row.winner: row.loser, row.loser: row.winner})
def_df['opponent'] = def_df.apply(lambda x: d[x['week']][x['team']], axis=1)
print(def_df)
team week opponent
0 GB 1 CHI
1 MIA 1 DET
2 CHI 1 GB
3 DET 1 MIA
4 GB 2 MIA
5 MIA 2 GB
6 CHI 2 DET
7 DET 2 CHI
An equally valid alternative using tuple keys, which avoids collections:
d = {}
for row in games_df.itertuples(index=False):
d[(row.week, row.winner)] = row.loser
d[(row.week, row.loser)] = row.winner
def_df['opponent'] = def_df.set_index(['week', 'team']).index.map(d.get)
Updated
Create a column of opponents
opponent_list = []
for team, week in zip(def_df['team'],def_df['week']):
for gameweek, winner, loser in zip(games_df['week'],games_df['winner'],games_df['loser']):
if gameweek == week and (winner ==team or loser ==team):
if winner == team:
opponent_list.append(loser)
else:
opponent_list.append(winner)
def_df['opponent'] = opponent_list

Generate columns of top ranked values in Pandas

I have a dataframe topic_data that contains the output of an LDA topic model:
topic_data.head(15)
topic word score
0 0 Automobile 0.063986
1 0 Vehicle 0.017457
2 0 Horsepower 0.015675
3 0 Engine 0.014857
4 0 Bicycle 0.013919
5 1 Sport 0.032938
6 1 Association_football 0.025324
7 1 Basketball 0.020949
8 1 Baseball 0.016935
9 1 National_Football_League 0.016597
10 2 Japan 0.051454
11 2 Beer 0.032839
12 2 Alcohol 0.027909
13 2 Drink 0.019494
14 2 Vodka 0.017908
This shows the top 5 terms for each topic, and the score (weight) for each. What I'm trying to do is reformat so that the index is the rank of the term, the columns are the topic IDs, and the values are formatted strings generated from the word and score columns (something along the lines of "%s (%.02f)" % (word,score)). That means the new dataframe should look something like this:
Topic 0 1 ...
Rank
0 Automobile (0.06) Sport (0.03) ...
1 Vehicle (0.017) Association_football (0.03) ...
... ... ... ...
What's the right way of going about this? I assume it involves a combination of index-setting, unstacking, and ranking, but I'm not sure of the right approach.
It would be something like this, note that Rank has to be generated first:
In [140]:
df['Rank'] = (-1*df).groupby('topic').score.transform(np.argsort)
df['New_str'] = df.word + df.score.apply(' ({0:.2f})'.format)
df2 = df.sort(['Rank', 'score'])[['New_str', 'topic','Rank']]
print df2.pivot(index='Rank', values='New_str', columns='topic')
topic 0 1 2
Rank
0 Automobile (0.06) Sport (0.03) Japan (0.05)
1 Vehicle (0.02) Association_football (0.03) Beer (0.03)
2 Horsepower (0.02) Basketball (0.02) Alcohol (0.03)
3 Engine (0.01) Baseball (0.02) Drink (0.02)
4 Bicycle (0.01) National_Football_League (0.02) Vodka (0.02)

Categories