I have a dataset containing football data of the premier league as such:
HomeTeam AwayTeam FTHG FTAG
0 Liverpool Norwich 4 1
1 West Ham Man City 0 5
2 Bournemouth Sheffield United 1 1
3 Burnley Southampton 3 0
... ... ... ... ...
where "FTHG" and "FTAG" are full-time home team goals and away team goals.
I need to write a function that calculates the final Premier League table given the results (in the form of a data frame). What I wrote is this function:
def calcScore(row):
if PL_df.iloc[row]['FTHG'] > PL_df.iloc[row]['FTAG']:
x = 3
y = 0
elif PL_df.iloc[row]['FTHG'] < PL_df.iloc[row]['FTAG']:
x = 0
y = 3
elif PL_df.iloc[row]['FTHG'] == PL_df.iloc[row]['FTAG']:
x = 1
y = 1
return x,y
this works, for example for the first row it gives this output:
in[1]: calcScore(0)
out[1]: (3,0)
now I need to create two columns HP and AP that contain the number of points awarded for Home and Away teams respectively using apply(). But I can't think of a way to do that.
I hope I was clear enough. Thank you in advance.
No need for a function (and also faster than apply):
win_or_draws = df['FTHG'] > df['FTAG'], df['FTHG'] == df['FTAG']
df['HP'] = np.select( win_or_draws, (3,1), 0)
df['AP'] = np.select(win_or_draws, (0,1),3)
Output:
HomeTeam AwayTeam FTHG FTAG HP AP
0 Liverpool Norwich 4 1 3 0
1 West Ham Man City 0 5 0 3
2 Bournemouth Sheffield United 1 1 1 1
3 Burnley Southampton 3 0 3 0
Related
I am working with a dataset that looks like this:
I want to analyze the most effective teams from this dataset so I have decided to calculate points based on their results and then calculate points per game. For reference, a win is 3 points, a draw is 1 point and a loss is 0 points. So to calculate points I decided to add two new columns that say how many points the home team and away team got. I did this by:
since_2018.loc[since_2018['home_score'] - since_2018['away_score'] > 0, 'points_home_team'] = 3
since_2018.loc[since_2018['home_score'] - since_2018['away_score'] == 0, 'points_home_team'] = 1
since_2018.loc[since_2018['home_score'] - since_2018['away_score'] < 0, 'points_home_team'] = 0
since_2018.loc[since_2018['home_score'] - since_2018['away_score'] > 0, 'points_away_team'] = 0
since_2018.loc[since_2018['home_score'] - since_2018['away_score'] == 0, 'points_away_team'] = 1
since_2018.loc[since_2018['home_score'] - since_2018['away_score'] < 0, 'points_away_team'] = 3
This is giving me a SettingWithCopyWarning but the code seems to be working fine as this is what the dataset looks like after:
To get total points I did this:
home_points = (since_2018.groupby(['home_team'])['points_home_team'].sum() + since_2018.groupby(['away_team'])['points_away_team'].sum())
home_points.sort_values(ascending=False)
Now I want to calculate points per game for each team in order to see which teams have been the most effective and I think I managed to get a number for games played by each team through this code:
matches = since_2018.groupby("home_team").count() + since_2018.groupby('away_team').count()
So now from there I am stuck as to how to actually use those number of games and points to get points per game. Any help is appreciated, thanks!
For reference, here is a text version of the dataset:
home_team away_team home_score away_score tournament city country neutral total_goals points_home_team points_away_team
date
2018-01-02 Iraq United Arab Emirates 0 0 Gulf Cup Kuwait City Kuwait True 0 1.0 1.0
2018-01-02 Oman Bahrain 1 0 Gulf Cup Kuwait City Kuwait True 1 3.0 0.0
2018-01-05 Oman United Arab Emirates 0 0 Gulf Cup Kuwait City Kuwait True 0 1.0 1.0
2018-01-07 Estonia Sweden 1 1 Friendly Abu Dhabi United Arab Emirates True 2 1.0 1.0
2018-01-11 Denmark Sweden 0 1 Friendly Abu Dhabi United Arab Emirates True 1 0.0 3.0
Try this to set the home/away scores:
net_goals = since_2018.home_score - since_2018.away_score
home_points = [3 if ng>0 else 1 if ng==0 else 0 for ng in net_goals]
And the inverse for away_points
Now change
matches = since_2018.groupby("home_team").count() + since_2018.groupby('away_team').count()
To
... .groupby('home_team').home_points.agg(['sum', 'count']) ...
and you are almost done
Create a dedicated total points data frame and add a column for the number of matches to it. Combine the home and away teams to calculate the points per game.
df_h = pd.concat([since_2018.groupby(['home_team'])['points_home_team'].sum(),since_2018.groupby("home_team").size()], keys='home_team', axis=1)
df_h.columns = ['points', 'matches']
df_a = pd.concat([since_2018.groupby(['away_team'])['points_away_team'].sum(),since_2018.groupby('away_team').size()], keys='away_team', axis=1)
df_a.columns = ['points', 'matches']
result = pd.concat([df_h, df_a], axis=0)
result['rate'] = result['points'] / result['matches']
result
points matches rate
Denmark 0.0 1 0.0
Estonia 1.0 1 1.0
Iraq 1.0 1 1.0
Oman 4.0 2 2.0
Bahrain 0.0 1 0.0
Sweden 4.0 2 2.0
United Arab Emirates 2.0 2 1.0
I have an empty matrix and I want to replace the matrix elements with 1 if country (index) belongs to Region (column).
I try to create a double loop, but I get stacked when I need to do the conditional. Thanks. ([152 rows x 6 columns]). Thanks so much.
west europe east europe latin america
Norway 0 0 0
Denmark 0 0 0
Iceland 0 0 0
Switzerland 0 0 0
Finland 0 0 0
Netherlands 0 0 0
Sweden 0 0 0
Austria 0 0 0
Ireland 0 0 0
Germany 0 0 0
Belgium 0 0 0
I was thinking smth like:
matrix = pd.DataFrame(np.random.randint(1, size=(152, 6)), index=['# enumarate all the countries], columns=['west europe', 'east europe', 'latin america','north america','africa', 'asia'])
print (matrix)
for i in range (len(matrix)):
for j in range(len(matrix)):
if data[i] =='Africa' and data['Country'] = [ '#here enumarate all Africa countries':
matrix[i][j]==1
elif:
....
matrix[i][j]==1
else:
matrix[i][j]==0
print (matrix)
Sample data frame with countries and region:
Country Happiness Rank Happiness Score Economy Family Health Freedom Generosity Corruption Dystopia Job Satisfaction Region
0 Norway 1 7.537 1.616463 1.533524 0.796667 0.635423 0.362012 0.315964 2.277027 94.6 Western Europe
1 Denmark 2 7.522 1.482383 1.551122 0.792566 0.626007 0.355280 0.400770 2.313707 93.5 Western Europe
2 Iceland 3 7.504 1.480633 1.610574 0.833552 0.627163 0.475540 0.153527 2.322715 94.5 Western Europe
3 Switzerland 4 7.494 1.564980 1.516912 0.858131 0.620071 0.290549 0.367007 2.276716 93.7 Western Europe
4 Finland 5 7.469 1.443572 1.540247 0.809158 0.617951 0.245483 0.382612 2.430182 91.2 Western Europe
5 Netherlands 6 7.377 1.503945 1.428939 0.810696 0.585384 0.470490 0.282662 2.294804 93.8 Western Europe
If your input variable data is a DataFrame, then as #Alollz mentioned, you can use the pandas pd.get_dummies function.
Something like this: pd.get_dummies(data, columns=['Region'])
And the output would look like:
Country HappinessRank HappinessScore Economy Family Health Freedom Generosity Corruption Dystopia JobSatisfaction Region_WesternEurope
0 Norway 1 7.537 1.616463 1.533524 0.796667 0.635423 0.362012 0.315964 2.277027 94.6 1
1 Denmark 2 7.522 1.482383 1.551122 0.792566 0.626007 0.355280 0.400770 2.313707 93.5 1
2 Iceland 3 7.504 1.480633 1.610574 0.833552 0.627163 0.475540 0.153527 2.322715 94.5 1
3 Switzerland 4 7.494 1.564980 1.516912 0.858131 0.620071 0.290549 0.367007 2.276716 93.7 1
4 Finland 5 7.469 1.443572 1.540247 0.809158 0.617951 0.245483 0.382612 2.430182 91.2 1
5 Netherlands 6 7.377 1.503945 1.428939 0.810696 0.585384 0.470490 0.282662 2.294804 93.8 1
It will take the Region category column and make it into indicator columns. In this case it uses the column name as the prefix but you can play around with that.
Here is the dataframe
MatchId EventCodeId EventCode Team1 Team2 Team1_Goals Team2_Goals xG_Team1 xG_Team2 CurrentPlaytime
0 865314 1029 Goal Home Northampton Crawley Town 2 2 2.067663207769023 0.8130662505484256 457040
1 865314 1029 Goal Home Northampton Crawley Town 2 2 2.067663207769023 0.8130662505484256 1405394
2 865314 2053 Goal Away Northampton Crawley Town 2 2 2.067663207769023 0.8130662505484256 1898705
3 865314 2053 Goal Away Northampton Crawley Town 2 2 2.067663207769023 0.8130662505484256 4388278
4 865314 1029 Goal Home Northampton Crawley Town 2 2 2.067663207769023 0.8130662505484256 4507898
5 865314 1030 Cancel Goal Home Northampton Crawley Town 2 2 2.067663207769023 0.8130662505484256 4517728
6 865314 1029 Goal Home Northampton Crawley Town 2 2 2.067663207769023 0.8130662505484256 4956346
7 865314 1030 Cancel Goal Home Northampton Crawley Town 2 2 2.067663207769023 0.8130662505484256 4960633
8 865316 2053 Goal Away Coventry Bradford 0 0 1.0847662440468118 1.2526705617472387 447858
9 865316 2054 Cancel Goal Away Coventry Bradford 0 0 1.0847662440468118 1.2526705617472387 456361
The new columns will be created as follows:
for EventCodeId = 1029 and EventCode = Goal Home
new_col1 = CurrentPlaytime/3*10**4
for EventCodeId = 2053 and ventCode = Goal Away
new_col2 = CurrentPlaytime/3*10**4
For every other EventCodeId and EventCode new_co1 and new_col2 will take 0.
Here is how I have started but couldn't go any further. please help
new_col1 = []
new_col2 = []
def timeslot(EventCodeId, EventCode, CurrentPlaytime):
if x == 1029 and y == 'Goal Home':
new.Col1.append(z/(3*10**4))
elif x == 2053 and y == 'Goal Away':
new_col2.append(z/(3*10**4))
else:
new_col1.append(0)
new_col2.append(0)
return new_col1
return new_col2
df1['new_col1', 'new_col2'] = df1.apply(lambda x,y,z: timeslot(x['EventCodeId'], y['EventCode'], z['CurrentPlaytime']), axis=1)
TypeError: ("<lambda>() missing 2 required positional arguments: 'y' and 'z'", 'occurred at index 0')
You do not need an explicit loop. Use vectorised operations where possible.
Using numpy.where:
s = df1['CurrentPlaytime']/3*10**4
mask1 = (df1['EventCodeId'] == 1029) & (df1['EventCode'] == 'Goal')
mask2 = (df1['EventCodeId'] == 2053) & (df1['EventCode'] == 'Away')
df1['new_col1'] = np.where(mask1, s, 0)
df1['new_col2'] = np.where(mask2, s, 0)
I have two dataframes as shown below.
Company Name BOD Position Ethnicity DOB Age Gender Degree ( Specialazation) Remark
0 Big Lots Inc. David J. Campisi Director, President and Chief Executive Offic... American 1956 61 Male Graduate NaN
1 Big Lots Inc. Philip E. Mallott Chairman of the Board American 1958 59 Male MBA, Finace NaN
2 Big Lots Inc. James R. Chambers Independent Director American 1958 59 Male MBA NaN
3 Momentive Performance Materials Inc Mahesh Balakrishnan director Asian 1983 34 Male BA Economics NaN
Company Name Net Sale Gross Profit Remark
0 Big Lots Inc. 5.2B 2.1B NaN
1 Momentive Performance Materials Inc 544M 146m NaN
2 Markel Corporation 5.61B 2.06B NaN
3 Noble Energy, Inc. 3.49B 2.41B NaN
4 Leidos Holding, Inc. 7.04B 852M NaN
I want to create a new dataframe with these two, so that in 2nd dataframe, I have new columns with count of ethinicities from each companies, such as American -2 Mexican -5 and so on, so that later on, i can calculate diversity score.
the variables in the output dataframe is like,
Company Name Net Sale Gross Profit Remark American Mexican German .....
Big Lots Inc. 5.2B 2.1B NaN 2 0 5 ....
First get counts per groups by groupby with size and unstack, last join to second DataFrame:
df1 = pd.DataFrame({'Company Name':list('aabcac'),
'Ethnicity':['American'] * 3 + ['Mexican'] * 3})
df1 = df1.groupby(['Company Name', 'Ethnicity']).size().unstack(fill_value=0)
#slowier alternative
#df1 = pd.crosstab(df1['Company Name'], df1['Ethnicity'])
print (df1)
Ethnicity American Mexican
Company Name
a 2 1
b 1 0
c 0 2
df2 = pd.DataFrame({'Company Name':list('abc')})
print (df2)
Company Name
0 a
1 b
2 c
df3 = df2.join(df1, on=['Company Name'])
print (df3)
Company Name American Mexican
0 a 2 1
1 b 1 0
2 c 0 2
EDIT: You need replace unit by 0 and convert to floats:
print (df)
Name sale
0 A 100M
1 B 200M
2 C 5M
3 D 40M
4 E 10B
5 F 2B
d = {'M': '0'*6, 'B': '0'*9}
df['a'] = df['sale'].replace(d, regex=True).astype(float).sort_values(ascending=False)
print (df)
Name sale a
0 A 100M 1.000000e+08
1 B 200M 2.000000e+08
2 C 5M 5.000000e+06
3 D 40M 4.000000e+07
4 E 10B 1.000000e+10
5 F 2B 2.000000e+09
I have a dataframe topic_data that contains the output of an LDA topic model:
topic_data.head(15)
topic word score
0 0 Automobile 0.063986
1 0 Vehicle 0.017457
2 0 Horsepower 0.015675
3 0 Engine 0.014857
4 0 Bicycle 0.013919
5 1 Sport 0.032938
6 1 Association_football 0.025324
7 1 Basketball 0.020949
8 1 Baseball 0.016935
9 1 National_Football_League 0.016597
10 2 Japan 0.051454
11 2 Beer 0.032839
12 2 Alcohol 0.027909
13 2 Drink 0.019494
14 2 Vodka 0.017908
This shows the top 5 terms for each topic, and the score (weight) for each. What I'm trying to do is reformat so that the index is the rank of the term, the columns are the topic IDs, and the values are formatted strings generated from the word and score columns (something along the lines of "%s (%.02f)" % (word,score)). That means the new dataframe should look something like this:
Topic 0 1 ...
Rank
0 Automobile (0.06) Sport (0.03) ...
1 Vehicle (0.017) Association_football (0.03) ...
... ... ... ...
What's the right way of going about this? I assume it involves a combination of index-setting, unstacking, and ranking, but I'm not sure of the right approach.
It would be something like this, note that Rank has to be generated first:
In [140]:
df['Rank'] = (-1*df).groupby('topic').score.transform(np.argsort)
df['New_str'] = df.word + df.score.apply(' ({0:.2f})'.format)
df2 = df.sort(['Rank', 'score'])[['New_str', 'topic','Rank']]
print df2.pivot(index='Rank', values='New_str', columns='topic')
topic 0 1 2
Rank
0 Automobile (0.06) Sport (0.03) Japan (0.05)
1 Vehicle (0.02) Association_football (0.03) Beer (0.03)
2 Horsepower (0.02) Basketball (0.02) Alcohol (0.03)
3 Engine (0.01) Baseball (0.02) Drink (0.02)
4 Bicycle (0.01) National_Football_League (0.02) Vodka (0.02)