Create column with multiple data frames and multiple conditions - python

I am looking at football data and trying to add an opponent column, but am struggling with the way that the data frames are organized.
****EDIT****
defense = {'week': [1, 1, 1, 1, 2, 2, 2, 2], 'team': ['GB', 'MIA', 'CHI', 'DET', 'GB', 'MIA', 'CHI', 'DET']}
games = {'week': [1, 1, 2, 2], 'winner': ['GB', 'MIA', 'GB', 'DET'], 'loser': ['CHI', 'DET', 'MIA', 'CHI']}
def_df = pd.DataFrame(data=defense)
games_df = pd.DataFrame(data=games)
def_df
team week
0 GB 1
1 MIA 1
2 CHI 1
3 DET 1
4 GB 2
5 MIA 2
6 CHI 2
7 DET 2
games_df
loser week winner
0 CHI 1 GB
1 DET 1 MIA
2 MIA 2 GB
3 CHI 2 DET
I am looking to add an defense['Opponent'] column based on that week.
team week Opponent
0 GB 1 CHI
1 MIA 1 DET
2 CHI 1 GB
3 DET 1 MIA
4 GB 2 MIA
5 MIA 2 GB
6 CHI 2 DET
7 DET 2 CHI
Thanks!

Here's one way using a nested dictionary mapping:
from collections import defaultdict
d = defaultdict(dict)
for row in games_df.itertuples(index=False):
d[row.week].update({row.winner: row.loser, row.loser: row.winner})
def_df['opponent'] = def_df.apply(lambda x: d[x['week']][x['team']], axis=1)
print(def_df)
team week opponent
0 GB 1 CHI
1 MIA 1 DET
2 CHI 1 GB
3 DET 1 MIA
4 GB 2 MIA
5 MIA 2 GB
6 CHI 2 DET
7 DET 2 CHI
An equally valid alternative using tuple keys, which avoids collections:
d = {}
for row in games_df.itertuples(index=False):
d[(row.week, row.winner)] = row.loser
d[(row.week, row.loser)] = row.winner
def_df['opponent'] = def_df.set_index(['week', 'team']).index.map(d.get)

Updated
Create a column of opponents
opponent_list = []
for team, week in zip(def_df['team'],def_df['week']):
for gameweek, winner, loser in zip(games_df['week'],games_df['winner'],games_df['loser']):
if gameweek == week and (winner ==team or loser ==team):
if winner == team:
opponent_list.append(loser)
else:
opponent_list.append(winner)
def_df['opponent'] = opponent_list

Related

create two columns based on a function with apply()

I have a dataset containing football data of the premier league as such:
HomeTeam AwayTeam FTHG FTAG
0 Liverpool Norwich 4 1
1 West Ham Man City 0 5
2 Bournemouth Sheffield United 1 1
3 Burnley Southampton 3 0
... ... ... ... ...
where "FTHG" and "FTAG" are full-time home team goals and away team goals.
I need to write a function that calculates the final Premier League table given the results (in the form of a data frame). What I wrote is this function:
def calcScore(row):
if PL_df.iloc[row]['FTHG'] > PL_df.iloc[row]['FTAG']:
x = 3
y = 0
elif PL_df.iloc[row]['FTHG'] < PL_df.iloc[row]['FTAG']:
x = 0
y = 3
elif PL_df.iloc[row]['FTHG'] == PL_df.iloc[row]['FTAG']:
x = 1
y = 1
return x,y
this works, for example for the first row it gives this output:
in[1]: calcScore(0)
out[1]: (3,0)
now I need to create two columns HP and AP that contain the number of points awarded for Home and Away teams respectively using apply(). But I can't think of a way to do that.
I hope I was clear enough. Thank you in advance.
No need for a function (and also faster than apply):
win_or_draws = df['FTHG'] > df['FTAG'], df['FTHG'] == df['FTAG']
df['HP'] = np.select( win_or_draws, (3,1), 0)
df['AP'] = np.select(win_or_draws, (0,1),3)
Output:
HomeTeam AwayTeam FTHG FTAG HP AP
0 Liverpool Norwich 4 1 3 0
1 West Ham Man City 0 5 0 3
2 Bournemouth Sheffield United 1 1 1 1
3 Burnley Southampton 3 0 3 0

How to check the matching of characters between 2 columns?

I have two columns, and I want to check if they match between 4 or more characters regardless of the position of the array, if they match then create a column that is OK if it matches and KO otherwise.
How can I do this in PYTHON or SQL LITE?
Example:
DATASET WITH ;
Street 1;Street 2
ASENSIO Y TOLEDO 15;AVILA 9
AVILA 9;AVILA 9
FISTERRA S/N;FINISTERRE S/N - SAN ROQUE
PASEO DEL PUER;PASEO DEL PUERTO SN
PASEO DEL PUER;PASEO DEL PUERTO SN
LA UNION 2;LA UNION 2
ALEGRIA 14;LA UNION 2
Thank you.
https://i.stack.imgur.com/gYLcg.png
Code:
def dataet():
df_dataset= pd.read_csv("C:/Users/Documents/DATASET2.CSV", sep=';')
print(df_dataset.columns.values)
query = """
SELECT INSTR(street 1, street 2)
FROM df_dataset
"""
result= pdsql.sqldf(query)
print(result)
In python you can use sets to get unique characters in a string and then & sets from Street 1 and Street 2 to get their union. I'm also removing spaces from the matching list, you don't want to count them, right?
df['count'] = ['OK' if len(set(x) & set(y) - set(' ')) >= 4 else 'KO' for x, y in zip(df['Street 1'].fillna(''), df['Street 2'].fillna(''))]
print(df)
Output:
Street 1 Street 2 count
0 ASENSIO Y TOLEDO 15 AVILA 9 KO
1 AVILA 9 AVILA 9 OK
2 FISTERRA S/N FINISTERRE S/N - SAN ROQUE OK
3 PASEO DEL PUER PASEO DEL PUERTO SN OK
4 PASEO DEL PUER PASEO DEL PUERTO SN OK
5 LA UNION 2 LA UNION 2 OK
6 ALEGRIA 14 LA UNION 2 KO
Update: If you're looking for the length of the longest common substring between Street 1 and Street 2:
from difflib import SequenceMatcher
z = df.fillna('')
z['count'] = [len(x[m.a:m.a+m.size].replace(' ', '')) for x, m in
[(x, SequenceMatcher(None, x, y).find_longest_match(0, len(x), 0, len(y)))
for x, y in zip(z['Street 1'], z['Street 2'])]]
z['match'] = ['OK' if x >= 4 else 'KO' for x in z['count']]
print(z)
Output:
Street 1 Street 2 count match
0 ASENSIO Y TOLEDO 15 AVILA 9 1 KO
1 AVILA 9 AVILA 9 6 OK
2 FISTERRA S/N FINISTERRE S/N - SAN ROQUE 6 OK
3 PASEO DEL PUER PASEO DEL PUERTO SN 12 OK
4 PASEO DEL PUER PASEO DEL PUERTO SN 12 OK
5 LA UNION 2 LA UNION 2 8 OK
6 ALEGRIA 14 LA UNION 2 1 KO
7 JARILLO 7 BO IZD SAN AMBROSIO 1 KO
8 STREET AVE PARRA PARRA STREET 4 6 OK
9 PARRA 4 0 KO
Also using numpy.where():
df['res'] = np.where([len(set(x) - set(y))>=4 for x, y in zip(df['Street 1'], df['Street 2'])], 'OK', 'KO')

pandas - how to extract top three rows from the dataframe provided

My pandas Data frame df could produce result as below:
grouped = df[(df['X'] == 'venture') & (df['company_code'].isin(['TDS','XYZ','UVW']))].groupby(['company_code','sector'])['X_sector'].count()
The output of this is as follows:
company_code sector
TDS Meta 404
Electrical 333
Mechanical 533
Agri 453
XYZ Sports 331
Electrical 354
Movies 375
Manufacturing 355
UVW Sports 505
Robotics 345
Movies 56
Health 3263
Manufacturing 456
Others 524
Name: X_sector, dtype: int64
What I want to get is the top three sectors within the company codes.
What is the way to do it?
You will have to chain a groupby here. Consider this example:
import pandas as pd
import numpy as np
np.random.seed(111)
names = [
'Robert Baratheon',
'Jon Snow',
'Daenerys Targaryen',
'Theon Greyjoy',
'Tyrion Lannister'
]
df = pd.DataFrame({
'season': np.random.randint(1, 7, size=100),
'actor': np.random.choice(names, size=100),
'appearance': 1
})
s = df.groupby(['season','actor'])['appearance'].count()
print(s.sort_values(ascending=False).groupby('season').head(1)) # <-- head(3) for 3 values
Returns:
season actor
4 Daenerys Targaryen 7
6 Robert Baratheon 6
3 Robert Baratheon 6
5 Jon Snow 5
2 Theon Greyjoy 5
1 Jon Snow 4
Where s is (clipped at 4)
season actor
1 Daenerys Targaryen 2
Jon Snow 4
Robert Baratheon 2
Theon Greyjoy 3
Tyrion Lannister 4
2 Daenerys Targaryen 4
Jon Snow 3
Robert Baratheon 1
Theon Greyjoy 5
Tyrion Lannister 3
3 Daenerys Targaryen 2
Jon Snow 1
Robert Baratheon 6
Theon Greyjoy 3
Tyrion Lannister 3
4 ...
Why would you want things to be complicated, when there are simple codes possible:
Z = df.groupby('country_code')['sector'].value_counts().groupby(level=0).head(3).sort_values(ascending=False).to_frame('counts').reset_index()
Z

Create a new pandas columns from multiple columns

Here is the dataframe
MatchId EventCodeId EventCode Team1 Team2 Team1_Goals Team2_Goals xG_Team1 xG_Team2 CurrentPlaytime
0 865314 1029 Goal Home Northampton Crawley Town 2 2 2.067663207769023 0.8130662505484256 457040
1 865314 1029 Goal Home Northampton Crawley Town 2 2 2.067663207769023 0.8130662505484256 1405394
2 865314 2053 Goal Away Northampton Crawley Town 2 2 2.067663207769023 0.8130662505484256 1898705
3 865314 2053 Goal Away Northampton Crawley Town 2 2 2.067663207769023 0.8130662505484256 4388278
4 865314 1029 Goal Home Northampton Crawley Town 2 2 2.067663207769023 0.8130662505484256 4507898
5 865314 1030 Cancel Goal Home Northampton Crawley Town 2 2 2.067663207769023 0.8130662505484256 4517728
6 865314 1029 Goal Home Northampton Crawley Town 2 2 2.067663207769023 0.8130662505484256 4956346
7 865314 1030 Cancel Goal Home Northampton Crawley Town 2 2 2.067663207769023 0.8130662505484256 4960633
8 865316 2053 Goal Away Coventry Bradford 0 0 1.0847662440468118 1.2526705617472387 447858
9 865316 2054 Cancel Goal Away Coventry Bradford 0 0 1.0847662440468118 1.2526705617472387 456361
The new columns will be created as follows:
for EventCodeId = 1029 and EventCode = Goal Home
new_col1 = CurrentPlaytime/3*10**4
for EventCodeId = 2053 and ventCode = Goal Away
new_col2 = CurrentPlaytime/3*10**4
For every other EventCodeId and EventCode new_co1 and new_col2 will take 0.
Here is how I have started but couldn't go any further. please help
new_col1 = []
new_col2 = []
def timeslot(EventCodeId, EventCode, CurrentPlaytime):
if x == 1029 and y == 'Goal Home':
new.Col1.append(z/(3*10**4))
elif x == 2053 and y == 'Goal Away':
new_col2.append(z/(3*10**4))
else:
new_col1.append(0)
new_col2.append(0)
return new_col1
return new_col2
df1['new_col1', 'new_col2'] = df1.apply(lambda x,y,z: timeslot(x['EventCodeId'], y['EventCode'], z['CurrentPlaytime']), axis=1)
TypeError: ("<lambda>() missing 2 required positional arguments: 'y' and 'z'", 'occurred at index 0')
You do not need an explicit loop. Use vectorised operations where possible.
Using numpy.where:
s = df1['CurrentPlaytime']/3*10**4
mask1 = (df1['EventCodeId'] == 1029) & (df1['EventCode'] == 'Goal')
mask2 = (df1['EventCodeId'] == 2053) & (df1['EventCode'] == 'Away')
df1['new_col1'] = np.where(mask1, s, 0)
df1['new_col2'] = np.where(mask2, s, 0)

Merge two pandas dataframe two create a new dataframe with a specific operation

I have two dataframes as shown below.
Company Name BOD Position Ethnicity DOB Age Gender Degree ( Specialazation) Remark
0 Big Lots Inc. David J. Campisi Director, President and Chief Executive Offic... American 1956 61 Male Graduate NaN
1 Big Lots Inc. Philip E. Mallott Chairman of the Board American 1958 59 Male MBA, Finace NaN
2 Big Lots Inc. James R. Chambers Independent Director American 1958 59 Male MBA NaN
3 Momentive Performance Materials Inc Mahesh Balakrishnan director Asian 1983 34 Male BA Economics NaN
Company Name Net Sale Gross Profit Remark
0 Big Lots Inc. 5.2B 2.1B NaN
1 Momentive Performance Materials Inc 544M 146m NaN
2 Markel Corporation 5.61B 2.06B NaN
3 Noble Energy, Inc. 3.49B 2.41B NaN
4 Leidos Holding, Inc. 7.04B 852M NaN
I want to create a new dataframe with these two, so that in 2nd dataframe, I have new columns with count of ethinicities from each companies, such as American -2 Mexican -5 and so on, so that later on, i can calculate diversity score.
the variables in the output dataframe is like,
Company Name Net Sale Gross Profit Remark American Mexican German .....
Big Lots Inc. 5.2B 2.1B NaN 2 0 5 ....
First get counts per groups by groupby with size and unstack, last join to second DataFrame:
df1 = pd.DataFrame({'Company Name':list('aabcac'),
'Ethnicity':['American'] * 3 + ['Mexican'] * 3})
df1 = df1.groupby(['Company Name', 'Ethnicity']).size().unstack(fill_value=0)
#slowier alternative
#df1 = pd.crosstab(df1['Company Name'], df1['Ethnicity'])
print (df1)
Ethnicity American Mexican
Company Name
a 2 1
b 1 0
c 0 2
df2 = pd.DataFrame({'Company Name':list('abc')})
print (df2)
Company Name
0 a
1 b
2 c
df3 = df2.join(df1, on=['Company Name'])
print (df3)
Company Name American Mexican
0 a 2 1
1 b 1 0
2 c 0 2
EDIT: You need replace unit by 0 and convert to floats:
print (df)
Name sale
0 A 100M
1 B 200M
2 C 5M
3 D 40M
4 E 10B
5 F 2B
d = {'M': '0'*6, 'B': '0'*9}
df['a'] = df['sale'].replace(d, regex=True).astype(float).sort_values(ascending=False)
print (df)
Name sale a
0 A 100M 1.000000e+08
1 B 200M 2.000000e+08
2 C 5M 5.000000e+06
3 D 40M 4.000000e+07
4 E 10B 1.000000e+10
5 F 2B 2.000000e+09

Categories