I have a pandas DataFrame which looks like this:
home_team away_team home_score away_score
Spain Albania 0 5
Albania Spain 4 1
Albania Portugal 1 2
Albania US 0 2
From the first two lines we see that Spain and Albania played 2 times in total, Spain scored 1 goal, Albania scored 9 goals.
Then Albania has 1 game with US and Portugal and it's scores. I am trying to answer 'How many goals Albania has scored against each country and how many goals that country has scored against Albania'
So that I would get a DataFrame like this:
Albania Spain 9 1
Albania Portugal 1 2
Albania US 0 2
When I use print(df.groupby(['away_team']).sum() + df.groupby(['home_team']).sum()) I do not get what I want and for some reason some lines are filled with NaNs. And it appears that the sums are not summing correctly.
You can sorting both columns teams and assign back, then swap values of scope if no match original home_team with sorted and last aggregate sum:
orig = df['home_team'].copy()
df[['home_team','away_team']] = np.sort(df[['home_team','away_team']], axis=1)
m = orig.ne(df['home_team'])
df.loc[m, ['home_score','away_score']] = df.loc[m, ['away_score','home_score']].values
print (df)
home_team away_team home_score away_score
0 Albania Spain 5 0
1 Albania Spain 4 1
2 Albania Portugal 1 2
3 Albania US 0 2
df1 = df.groupby(['home_team', 'away_team'], as_index=False).sum()
print (df1)
home_team away_team home_score away_score
0 Albania Portugal 1 2
1 Albania Spain 9 1
2 Albania US 0 2
sort the home & away team columns alphabetically to generate two new team columns
add two more columns for score_for & score against.
group by the two new team columns and sum the two new score columns
df[['team1', 'team2']] = df[['home_team', 'away_team']].apply(np.sort, axis=1, result_type='expand')
df[['score_for', 'score_against']] = df.apply(
lambda x: [x.home_score, x.away_score] if x.team1 == x.home_team else [x.away_score, x.home_score],
axis=1,
result_type='expand')
df.groupby(['team1', 'team2'])[['score_for', 'score_against']].sum()
score_for score_against
team1 team2
Albania Portugal 1 2
Spain 9 1
US 0 2
Swap home team and away team, as well as score. Concatenate two dataframes together and groupby.
df1 = df.T.reset_index(drop=True)
df2 = df1.rename({0:1, 1:0, 2:3, 3:2}).sort_index()
pd.concat([df1.T, df2.T]).groupby([0,1]).sum().loc[['Albania']]
home = df[df.home_team == "Albania"]
home.columns = ["country","opponent","win","lost"]
away = df[df.away_team == "Albania"]
away.columns = ["opponent","country","lost","win"]
pd.concat([home,away],ignore_index=True).groupby(["country","opponent"]).sum()
Related
I have a dataframe (results) of EPL results from the past 28 years, and I am trying to calculate the average home team points (HPts) from their previous 5 home games within the current season. The rows are already in chronological order. What I am effectively looking for is a version of the starter code below that partitions by HomeTeam and Season and calculates the mean of HPts using a window of the previous 5 rows with matching HomeTeam and Season. Clearly the existing code as written does not do what I need (it looks only at last 5 rows regardless of team and season), but is just there to show what I mean as a starting point.
HomeTeam AwayTeam Season Result HPts APts
0 Arsenal Coventry 1993 A 0 3
1 Aston Villa QPR 1993 H 3 0
2 Chelsea Blackburn 1993 A 0 3
3 Liverpool Sheffield Weds 1993 H 3 0
4 Man City Leeds 1993 D 1 1
.. ... ... ... ... ... ...
375 Liverpool Crystal Palace 2020 H 3 0
376 Man City Everton 2020 H 3 0
377 Sheffield United Burnley 2020 H 3 0
378 West Ham Southampton 2020 H 3 0
379 Wolves Man United 2020 A 0 3
[10804 rows x 6 columns]
# Starting point for my code for home team avg points from last 5 home games
results['HomeLast5'] = results['HPts'].rolling(5).mean()
Anyone know how I can add a new column with the rolling average points for a given team and season? I could probably figure out a way of doing this with a loop, but I'm sure that's not going to be the most efficient way to solve this problem.
Group the dataframe by HomeTeam and Season, then calculate rolling mean on HPts. Then, in order to assign the calculated mean back to the original dataframe drop the levels 0, 1 from the index so that index alignment would work properly.
g = results.groupby(['HomeTeam', 'Season'])['HPts']
results['HomeLast5'] = g.rolling(5).mean().droplevel([0, 1])
I have a csv which I read with pandas:
data looks like this
home_team away_team home_score away_score
Scotland England 0 0
England Scotland 4 2
Scotland England 2 1
... ... ... ...
I want to write a function that would take two parameters - two teams.
And it would output how many times game as won by team1, by team2 and how mane draw games there were
I've tried comparing the scores, but not sure how would I code when the same team might be on both home and away columns
def who_won(team1, team2):
home = data['home_team']
away = data['away_team']
home_score = data['home_score']
away_score = data['away_score']
counter_won = 0
counter_lost = 0
counter_draw = 0
for item in range(len(data['home_team'])):
if home_score > away_score:
home.append(counter_won)
counter_won = counter_won + 1
elif home_score < away_score:
home.append(counter_won)
counter_lost = counter_lost + 1
else:
counter_draw = counter_draw + 1
But I am not sure how should I proceed with comparing the games and counting how many times each time have won, lost or had a draw.
Desired output would be
England won 1 time versus Scotland
Scotland won 1 time versus England
Scotland and England had one draw
You could do some pre-processing to your data and then use groupby method of pandas DataFrame to get the output you want
1) Pre-Processing
Add two columns, one that has a tuple of (home,away) teams which I call match and one to show the match result.
df['match'] = list(zip(df.home_team, df.away_team))
To get the match result, you will need a function:
def match_result(row):
if row.home_score > row.away_score:
return row.home_team + ' won'
elif row.home_score < row.away_score:
return row.away_team + ' won'
else:
return 'draw'
df['result'] = df.apply(match_result, axis=1)
2) Group by
Then you filter the dataset to include only those matches that are between the input home and away teams. Finally, you group data by results and count the number each possible outcome:
df.loc[df.match.isin([(team1, team2), (team2, team1)]), 'result'].groupby(df.result).count()
Test
home_team away_team home_score away_score result \
0 Scotland England 0 0 draw
1 England Scotland 4 2 England won
2 Scotland England 2 1 Scotland won
match
0 (Scotland, England)
1 (England, Scotland)
2 (Scotland, England)
result
England won 1
Scotland won 1
draw 1
Name: result, dtype: int64
Actually, the filter by away-home would be easier to achieve:
df['won'] = np.sign(df['home_score']-df['away_score'])
df.groupby(['home_team','away_team'])['won'].value_counts()
Output:
home_team away_team won
England Scotland 1 1
Scotland England 0 1
1 1
Name: won, dtype: int64
In your case, it's a little trickier:
# home team won/lost/tied
df['won'] = np.sign(df['home_score']-df['away_score'])
# we don't care about home/away, so we sort the pair by name
# but we need to revert the result first:
df['won'] = np.where(df['home_team'].lt(df['away_team']),
df['won'], -df['won'])
# sort the pair home/away
df[['home_team','away_team']] = np.sort(df[['home_team','away_team']], axis=1)
# value counts:
df.groupby(['home_team','away_team'])['won'].value_counts()
Output:
home_team away_team won
England Scotland -1 1
0 1
1 1
Name: won, dtype: int64
My solution takes into account such details that:
Both teams (team1 and team2) can be either home or away, but
you want to know how many times team1 won / lost / tied with team2.
The source DataFrame contains also matches with other teams or when
both home and away teams are "other" (different than the 2 we are
interested in).
To get the result, define your function as follows:
def who_won(team1, team2):
df1 = df.query('home_team == #team1 and away_team == #team2')\
.set_axis(['tm1', 'tm2', 's1', 's2'], axis=1, inplace=False)
df2 = df.query('home_team == #team2 and away_team == #team1')\
.set_axis(['tm2', 'tm1', 's2', 's1'], axis=1, inplace=False)
df3 = pd.concat([df1, df2], sort=False).reset_index(drop=True)
dif = df3.s1 - df3.s2
bins = pd.cut(dif, bins=[-100, -1, 0, 100], labels=['lost', 'draw', 'won'])
return dif.groupby(bins).count()
Note a clever trick, how I "swap" home and away teams when team2 was
home team (df2).
Then I concatenate both df1 and df2, such that team1 is always in tm1
column.
So now df3.s1 - df3.s2 is the difference between goals by team1 and goals
by team2 (note that other solutions failed to recognize this difference).
Then, calling cut introduces proper categorical names (lost / draw /
won), providing an intuitive access to each component of the final result.
To test this function, I took a bit greater DataFrame, including also other teams:
home_team away_team home_score away_score
0 Scotland England 0 0
1 England Scotland 4 2
2 England Scotland 3 1
3 Scotland England 2 1
4 Scotland Wales 3 1
5 Wales Scotland 2 1
Then I called who_won('England', 'Scotland') getting the result:
lost 1
draw 1
won 2
dtype: int64
The result is a Series with CategoricalIndex (lost / draw / won).
If you want to reformat this result to your desired output,
and get each "component", it is easy.
E.g. to get the number of matches when England won with Scotland,
run res['won'].
I have a multi-index dataset like this:
mean std
Happiness Score Happiness Score
Region
Australia and New Zealand 7.302500 0.020936
Central and Eastern Europe 5.371184 0.578274
Eastern Asia 5.632333 0.502100
Latin America and Caribbean 6.069074 0.728157
Middle East and Northern Africa 5.387879 1.031656
North America 7.227167 0.179331
Southeastern Asia 5.364077 0.882637
Southern Asia 4.590857 0.535978
Sub-Saharan Africa 4.150957 0.584945
Western Europe 6.693000 0.777886
I would like to sort it by standard deviation.
My attempt:
import numpy as np
import pandas as pd
df1.sort_values(by=('Region','std'))
How to fix the problem?
Setup
np.random.seed(0)
df = pd.DataFrame(np.random.choice(10, (5, 2)))
df.columns = pd.MultiIndex.from_arrays([['mean', 'std'], ['Happiness Score'] * 2])
df
mean std
Happiness Score Happiness Score
0 5 0
1 3 3
2 7 9
3 3 5
4 2 4
You can use argsort and reindex df:
df.loc[:, ('std', 'Happiness Score')].argsort().values
# array([0, 1, 4, 3, 2])
df.iloc[df.loc[:, ('std', 'Happiness Score')].argsort().values]
# df.iloc[np.argsort(df.loc[:, ('std', 'Happiness Score')])]
mean std
Happiness Score Happiness Score
0 5 0
1 3 3
4 2 4
3 3 5
2 7 9
Another solution is sort_values, passing a tuple:
df.sort_values(by=('std', 'Happiness Score'), axis=0)
mean std
Happiness Score Happiness Score
0 5 0
1 3 3
4 2 4
3 3 5
2 7 9
I think you had the idea right, but the ordering of the tuples incorrect.
How do I left join tables with 1:n relationship, while keeping the number of rows the same as left table and concatenating any duplicate data with a character/string like ';'.
Example: Country Table
CountryID Country Area
1 UK 1029
2 Russia 8374
Cities Table
CountryID City
1 London
1 Manchester
2 Moscow
2 Ufa
I want:
CountryID Country Area Cities
1 UK 1029 London;Manchester
2 Russia 8374 Moscow;Ufa
I know how to perform a normal left join
country.merge(city, how='left', on='CountryID')
which gives me four rows instead of two:
Area Country CountryID City
1029 UK 1 London
1029 UK 1 Manchester
8374 Russia 2 Moscow
8374 Russia 2 Ufa
Use map by Series created by groupby + join for new column in df1 if performance is important:
df1['Cities'] = df1['CountryID'].map(df2.groupby('CountryID')['City'].apply(';'.join))
print (df1)
CountryID Country Area Cities
0 1 UK 1029 London;Manchester
1 2 Russia 8374 Moscow;Ufa
Detail:
print (df2.groupby('CountryID')['City'].apply(';'.join))
CountryID
1 London;Manchester
2 Moscow;Ufa
Name: City, dtype: object
Another solution with join:
df = df1.join(df2.groupby('CountryID')['City'].apply(';'.join), on='CountryID')
print (df)
CountryID Country Area City
0 1 UK 1029 London;Manchester
1 2 Russia 8374 Moscow;Ufa
This will give you the desired result:
df1.merge(df2, on='CountryID').groupby(['CountryID', 'Country', 'Area']).agg({'City': lambda x: ';'.join(x)}).reset_index()
# CountryID Country Area City
#0 1 UK 1029 London;Manchester
#1 2 Russia 8374 Moscow;Ufa
I know the question name is a little ambiguous.
My goal is to assign global key column based on 2 columns + unique value in my data frame.
For example
CountryCode | Accident
AFG Car
AFG Bike
AFG Car
AFG Plane
USA Car
USA Bike
UK Car
Let Car = 01, Bike = 02, Plane = 03
My desire global key format is [Accident][CountryCode][UniqueValue]
Unique value is a count of similar [Accident][CountryCode]
so if Accident = Car and CountryCode = AFG and it is the first occurrence, the global key would be 01AFG01
The desired dataframe would look like this:
CountryCode | Accident | GlobalKey
AFG Car 01AFG01
AFG Bike 02AFG01
AFG Car 01AFG02
AFG Plane 01AFG03
USA Car 01USA01
USA Bike 01USA02
UK Car 01UK01
I have tried running a for loop to append Accident Number and CountryCode together
for example:
globalKey = []
for x in range(0,6):
string = df.iloc[x, 1]
string2 = df.iloc[x, 2]
if string2 == 'Car':
number = '01'
elif string2 == 'Bike':
number = '02'
elif string2 == 'Plane':
number = '03'
#Concat the number of accident and Country Code
subKey = number + string
#Append to the list
globalKey.append(subKey)
This code will provide me with something like 01AFG, 02AFG based on the value I assign. but I want to assign a unique value by counting the occurrence of when CountryCode and Accident is similar.
I am stuck with the code above. I think there should be a better way to do it using map function in Pandas.
Thanks for helping guys!
much appreciate!
You can try with cumcount to achieve this in a number of steps, like this:
In [1]: df = pd.DataFrame({'Country':['AFG','AFG','AFG','AFG','USA','USA','UK'], 'Accident':['Car','Bike','Car','Plane','Car','Bike','Car']})
In [2]: df
Out[2]:
Accident Country
0 Car AFG
1 Bike AFG
2 Car AFG
3 Plane AFG
4 Car USA
5 Bike USA
6 Car UK
## Create a column to keep incremental values for `Country`
In [3]: df['cumcount'] = df.groupby('Country').cumcount()
In [4]: df
Out[4]:
Accident Country cumcount
0 Car AFG 0
1 Bike AFG 1
2 Car AFG 2
3 Plane AFG 3
4 Car USA 0
5 Bike USA 1
6 Car UK 0
## Create a column to keep incremental values for combination of `Country`,`Accident`
In [5]: df['cumcount_type'] = df.groupby(['Country','Accident']).cumcount()
In [6]: df
Out[6]:
Accident Country cumcount cumcount_type
0 Car AFG 0 0
1 Bike AFG 1 0
2 Car AFG 2 1
3 Plane AFG 3 0
4 Car USA 0 0
5 Bike USA 1 0
6 Car UK 0 0
And from that point on you can concatenate the values of cumcount, cumcount_type and Country to achieve what you're after.
Maybe you want to add 1 to each of the values you have under the different counts, depending on whether you want to start counting from 0 or 1.
I hope this helps.
First of all, don't use for loops if you can help it. For example, you can do your Accident to code mapping with:
df['AccidentCode'] = df['Accident'].map({'Car': '01', 'Bike': '02', 'Plane': '03'})
To get the unique code, Thanos has shown how to do that using GroupBy.cumcount:
df['CA_ID'] = df.groupby(['CountryCode', 'Accident']).cumcount() + 1
And then to put them all together into a unique key:
df['NewKey'] = df['AccidentCode'] + df['CountryCode'] + df['CA_ID'].map('{:0>2}'.format)
which gives:
CountryCode Accident GlobalKey AccidentCode CA_ID NewKey
0 AFG Car 01AFG01 01 1 01AFG01
1 AFG Bike 02AFG01 02 1 02AFG01
2 AFG Car 01AFG02 01 2 01AFG02
3 AFG Plane 01AFG03 03 1 03AFG01
4 USA Car 01USA01 01 1 01USA01
5 USA Bike 01USA02 02 1 02USA01
6 UK Car 01UK01 01 1 01UK01
After you create your subKey we can sort the dataframe and count the occurences of the couples. First let's reset the index (to store the original order)
df = df.reset_index()
then sort by the subKey and count
df = df.sort_values(by='subKey')
df['newnumber'] = 1
for ind in range(1, len(df)): #start by 1 because first row is always 1
if df.loc[ind, 'subKey'] == df.loc[ind - 1, 'subKey']:
df.loc[ind, 'newnumber'] = df.loc[ind - 1, 'newnumber'] + 1
Finally create the GlobalKey with the help of the zfill function, the reorder by index:
df['GlobalKey'] = df.apply(lambda x: x['subKey'] + str(x['new_number']).zfill(2), 1)
df = df.sort_values(by='index').drop('index', 1).reset_index(drop=True)
I don't have any experience of Pandas so this answer may not be what you are looking for. That being said, if the data you have is really that simple (few countries, few accident types) have you considered storing each country|accident combination in their own value?
So as you traverse your input, just increment the counter for that country|accident combination, and then read through those counters at the end to produce the GlobalKeys.
If you have other data to store besides the Global Key, then store the country|accident combinations as lists, and read through them at the end one-at-a-time to produce GlobalKeys.