Rolling average with filters - Python - python

I have a dataframe (results) of EPL results from the past 28 years, and I am trying to calculate the average home team points (HPts) from their previous 5 home games within the current season. The rows are already in chronological order. What I am effectively looking for is a version of the starter code below that partitions by HomeTeam and Season and calculates the mean of HPts using a window of the previous 5 rows with matching HomeTeam and Season. Clearly the existing code as written does not do what I need (it looks only at last 5 rows regardless of team and season), but is just there to show what I mean as a starting point.
HomeTeam AwayTeam Season Result HPts APts
0 Arsenal Coventry 1993 A 0 3
1 Aston Villa QPR 1993 H 3 0
2 Chelsea Blackburn 1993 A 0 3
3 Liverpool Sheffield Weds 1993 H 3 0
4 Man City Leeds 1993 D 1 1
.. ... ... ... ... ... ...
375 Liverpool Crystal Palace 2020 H 3 0
376 Man City Everton 2020 H 3 0
377 Sheffield United Burnley 2020 H 3 0
378 West Ham Southampton 2020 H 3 0
379 Wolves Man United 2020 A 0 3
[10804 rows x 6 columns]
# Starting point for my code for home team avg points from last 5 home games
results['HomeLast5'] = results['HPts'].rolling(5).mean()
Anyone know how I can add a new column with the rolling average points for a given team and season? I could probably figure out a way of doing this with a loop, but I'm sure that's not going to be the most efficient way to solve this problem.

Group the dataframe by HomeTeam and Season, then calculate rolling mean on HPts. Then, in order to assign the calculated mean back to the original dataframe drop the levels 0, 1 from the index so that index alignment would work properly.
g = results.groupby(['HomeTeam', 'Season'])['HPts']
results['HomeLast5'] = g.rolling(5).mean().droplevel([0, 1])

Related

How to multiply two dataframes of different shapes

I have two dataframes:
the first datframe df1 looks like this:
variable value
0 plastic 5774
2 glass 42
4 ferrous metal 642
6 non-ferrous metal 14000
8 paper 4000
Here is the head of the second dataframe df2:
waste_type total_waste_recycled_tonne year energy_saved
non-ferrous metal 160400.0 2015 NaN
glass 14600.0 2015 NaN
ferrous metal 15200 2015 NaN
plastic 766800 2015 NaN
I want to update the energy_saved in the second dataframe df2 such that I multiply the total_waste_recycled_tonne variable in df2 by the variable in df1 into the energy_saved column in df2.
For example:
For plastic: 5774 will be multipled with every waste_type platic with the total_waste_recycled_tonne variable in df2
ie:
energy_saved = 5774 * 766800
Here is what I tried:
df2["energy_saved"] = df1[df1["variable"]=="plastic"]["value"].values[0] * df2["total_waste_recycled_tonne"][df2["waste_type"]=="plastic"]
However the problem was that when I do others, the rest changes back to NaN. I need a better approach to handle this?
Use map:
df2['energy_saved'] = (df2['waste_type'].map(df1.set_index('variable')['value'])
.mul(df2['total_waste_recycled_tonne']
)
Try via merge() and pass how='right':
df=df1[['variable','value']].merge(df2[['waste_type','total_waste_recycled_tonne']],left_on='variable',right_on='waste_type',how='right')
Finally:
df2["energy_saved"]=df['value'].mul(df['total_waste_recycled_tonne'])
Output of df2:
waste_type total_waste_recycled_tonne year energy_saved
0 non-ferrous metal 160400.0 2015 2.245600e+09
1 glass 14600.0 2015 6.132000e+05
2 ferrous metal 15200.0 2015 9.758400e+06
3 plastic 766800.0 2015 4.427503e+09
4 plastic 762700.0 2015 4.403830e+09
A set_index + reindex option:
df2['energy_saved'] = (
df1.set_index('variable').reindex(df2['waste_type'])['value'] *
df2.set_index('waste_type')['total_waste_recycled_tonne']
).values
df2:
waste_type total_waste_recycled_tonne year energy_saved
0 non-ferrous metal 160400.0 2015 2.245600e+09
1 glass 14600.0 2015 6.132000e+05
2 ferrous metal 15200.0 2015 9.758400e+06
3 plastic 766800.0 2015 4.427503e+09
4 plastic 762700.0 2015 4.403830e+09

How do you sum up rows in Pandas based on conditions for multuples columns and remove the duplicates?

First let me apologise for the long winded question. I've struggled to find an answer on Stackoverflow that addresses my specific issue.
I am new to Pandas and Python programming so I would appreciate all the help I can get.
I have a dataframe:
ID Name Colour Power Year Money (millions)
0 1234567 Tony Stark Red Genius 2020 20000
1 9876543 Peter Parker Red Spider 2021 75
2 1415926 Miles Morales Green Spider 2021 55
3 7777777 Dante Brisco Blue hybrid 2020 3
4 4355681 Thor Odinson Blue Lightning 2020 655
5 1928374 Bruce Wayne Yellow Bat 2021 12000
6 5555555 Eddie Brock Black Symbiote 2021 755
7 8183822 Billie Butcher Yellow V 2021 34
8 6666654 Ian Wilm Red Lightning 2020 34
9 4241111 Harry Potter Green Wizard 2020 24
10 7765434 Blu Malk Red Wizard 2021 77
11 6464647 Yu Hant Black Wizard 2021 65
I want to create a new df that looks like this:
**Colour Total Year 2020 Year 2021**
Red 20186 20034 152
Green 79 24 55
Blue 658 658 -------
Yellow 12034 ------- 12034
Black 820 ------- 820
Where the "Colour" column becomes the new primary key/ID, the duplicates are removed and the values per year are summed up along with an overall total. I have managed to sum up the Total but I am struggling to write a function that will sum up rows by year and than assign the sum to the respective colour. I would eventually like to Create new columns based on calculations from the Yearly columns (percentages)
Here is what I have after creating the DF from an excel file :
#This line helps me calculate the total from the old df.
df['Total'] = df.groupby(['Colour'])['Money (millions)'].transform('sum')
#This line drops the duplicates from the line above. So now I have a total column that matches the #Colours
new_df = df.drop_duplicates(subset=['Colour'])
When I repeat the process for the Yearly column using the same technique it sums up the overall total for the whole year and assigns it to every colour.
I would eventually like to Create new columns based on calculations from the Yearly columns (percentages) e.g.
new_df['Success Rate'] = new_df['Total'].apply(lambda x: (x/100)*33)
I'd be grateful for any help provided :)
You can use:
df = pd.pivot_table(df, index='Colour', values='Money (millions)', columns='Year', aggfunc='sum', margins=True)
df
Out[1]:
Year 2020 2021 All
Colour
Black NaN 820.0 820
Blue 658.0 NaN 658
Green 24.0 55.0 79
Red 20034.0 152.0 20186
Yellow NaN 12034.0 12034
All 20716.0 13061.0 33777
I think this is pivot_table with margins:
df.pivot_table(index='Colour', columns='Year',
values='Money (millions)',
aggfunc='sum',
margins_name='Total',
margins=True)
OUtput:
Year 2020 2021 Total
Colour
Black NaN 820.0 820
Blue 658.0 NaN 658
Green 24.0 55.0 79
Red 20034.0 152.0 20186
Yellow NaN 12034.0 12034
Total 20716.0 13061.0 33777

Getting the sum of values with a condition

I have a pandas DataFrame which looks like this:
home_team away_team home_score away_score
Spain Albania 0 5
Albania Spain 4 1
Albania Portugal 1 2
Albania US 0 2
From the first two lines we see that Spain and Albania played 2 times in total, Spain scored 1 goal, Albania scored 9 goals.
Then Albania has 1 game with US and Portugal and it's scores. I am trying to answer 'How many goals Albania has scored against each country and how many goals that country has scored against Albania'
So that I would get a DataFrame like this:
Albania Spain 9 1
Albania Portugal 1 2
Albania US 0 2
When I use print(df.groupby(['away_team']).sum() + df.groupby(['home_team']).sum()) I do not get what I want and for some reason some lines are filled with NaNs. And it appears that the sums are not summing correctly.
You can sorting both columns teams and assign back, then swap values of scope if no match original home_team with sorted and last aggregate sum:
orig = df['home_team'].copy()
df[['home_team','away_team']] = np.sort(df[['home_team','away_team']], axis=1)
m = orig.ne(df['home_team'])
df.loc[m, ['home_score','away_score']] = df.loc[m, ['away_score','home_score']].values
print (df)
home_team away_team home_score away_score
0 Albania Spain 5 0
1 Albania Spain 4 1
2 Albania Portugal 1 2
3 Albania US 0 2
df1 = df.groupby(['home_team', 'away_team'], as_index=False).sum()
print (df1)
home_team away_team home_score away_score
0 Albania Portugal 1 2
1 Albania Spain 9 1
2 Albania US 0 2
sort the home & away team columns alphabetically to generate two new team columns
add two more columns for score_for & score against.
group by the two new team columns and sum the two new score columns
df[['team1', 'team2']] = df[['home_team', 'away_team']].apply(np.sort, axis=1, result_type='expand')
df[['score_for', 'score_against']] = df.apply(
lambda x: [x.home_score, x.away_score] if x.team1 == x.home_team else [x.away_score, x.home_score],
axis=1,
result_type='expand')
df.groupby(['team1', 'team2'])[['score_for', 'score_against']].sum()
score_for score_against
team1 team2
Albania Portugal 1 2
Spain 9 1
US 0 2
Swap home team and away team, as well as score. Concatenate two dataframes together and groupby.
df1 = df.T.reset_index(drop=True)
df2 = df1.rename({0:1, 1:0, 2:3, 3:2}).sort_index()
pd.concat([df1.T, df2.T]).groupby([0,1]).sum().loc[['Albania']]
home = df[df.home_team == "Albania"]
home.columns = ["country","opponent","win","lost"]
away = df[df.away_team == "Albania"]
away.columns = ["opponent","country","lost","win"]
pd.concat([home,away],ignore_index=True).groupby(["country","opponent"]).sum()

Count occuriences depending on condition & save in new column

I am relatively new to pandas / python.
I have a list of names and dates. I want to group the entries by Name and count the number of Names for 'after 2016' and 'before 2016'. The count should be added to a new column.
My input:
Name Date
Marc 2006
Carl 2003
Carl 2002
Carl 1990
Marc 1999
Max 2016
Max 2014
Marc 2006
Carl 2003
Carl 2002
Carl 2019
Marc 1999
Max 2016
Max 2014
And the output, should look like this:
Before
2016 Count
Marc 1 4
Marc 0 0
Carl 1 5
Carl 0 1
Max 1 2
Max 0 2
So the Output should have 2 entries for each Name, one with a count of Names before 2016 and one after. Addtionally a column which just stats 1 for before 2016 and 0 for after.
As mentioned before, I am quite a beginner. I was able to count the entries with the condition of the year:
df.groupby('Name')['Date'].apply(lambda x: (x<'2016').sum()).reset_index(name='count')
But honestly, I am not quite sure what to do next. Maybe somebody could point me in the right direction.
You can pass to apply a function which returns a 2x2 dataframe. Something like this:
def counting(x):
bef = (x < 2016).sum()
aft = (x > 2016).sum()
return pd.DataFrame([[1, bef], [0, aft]], index=[x.name, x.name], columns=["before 2016", "Count"])
ddf = df.groupby('Name')['Date'].apply(counting).reset_index(level=0, drop=True)
ddf is:
before 2016 Count
Carl 1 5
Carl 0 1
Marc 1 4
Marc 0 0
Max 1 2
Max 0 0
You can group by an external series having the same length as the dataframe:
s = df['Date'].lt(2016).astype('int')
s.name = 'Before 2016'
df.groupby(['Name', s]).count()
Result:
Date
Name Before 2016
Carl 0 1
1 5
Marc 1 4
Max 0 2
1 2
lt stands for "less than". Other comparison functions are le (less than or equal), gt (greater than), ge (greater than or equal) and eq (equal)
From what I understand you need to populate both 1 and 0 for each names, try with pivot_table with df.unstack():
(df.assign(Before=df['Date'].lt(2016).view('i1'))
.pivot_table('Date','Name','Before',aggfunc='count',fill_value=0).unstack()
.sort_index(level=1).reset_index(0,name='Count'))
Before Count
Name
Carl 0 1
Carl 1 5
Marc 0 0
Marc 1 4
Max 0 2
Max 1 2

Boxplot: Extract outliers and tag them as either '0' or '1'

I'm trying to extract outliers from my dataset and tag them accordingly.
Sample Data
Doctor Name Hospital Assigned Region Claims Illness Claimed
1 Albert Some hospital Center R-1 20 Sepsis
2 Simon Another hospital Center R-2 21 Pneumonia
3 Alvin ... ... ... ...
4 Robert
5 Benedict
6 Cruz
So I'm trying to group every Doctor that Claimed a certain Illness in a certain Region and trying to find outliers among them.
Doctor Name Hospital Assigned Region Claims Illness Claimed is_outlier
1 Albert Some hospital Center R-1 20 Sepsis 1
2 Simon Another hospital Center R-2 21 Pneumonia 0
3 Alvin ... ... ... ...
4 Robert
5 Benedict
6 Cruz
I can do this in Power BI. But being fairly new to Python, I can't seem to figure this out.
This is what I'm trying to achieve:
Algo goes like:
Read data
Group data by Illness
Group by Region
get IQR based on Claims Count
if claims count > than (Q3 + 1.5) * IQR
then tag it as outlier = 1
else
not an outlier = 0
Export data
Any ideas?
Assuming you use pandas for data analysis (and you should!) You can use pandas dataframe boxplot to produce a plot similar to yours:
import pandas as pd
import numpy as np
df.boxplot(column=['b'], whis=[10, 90], vert=False,
flierprops=dict(markerfacecolor='g', marker='D'))
or, if you want to mark them 0,1 as you requested, use dataframe quantile() method https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html
df.assign(outlier=df[df>=df.quantile(.9)].any(axis=1)).astype(np.int8)
a b outlier
0 1 1 0
1 2 10 0
2 3 100 1
3 4 100 1

Categories