Dataframe Interpolation based on table values - python

I have a dataset with different lats values, the range of these latitudes are between 0 to 20, and months between 1-12
How can I compute a new row in my dataset that has the result of N by each latitude and month?
As the latitude is not integer value in my dataset its necessary to do the interpolation
INPUT DATASET
LAT YEAR MONTH
0 11 2000 1
1 9 2000 2
2 11 2000 3
3 10 2000 4
4 17 2000 5
5 6 2000 6
6 18 2000 7
7 11 2000 8
8 17 2000 9
9 12 2000 10
10 19 2000 11
11 8 2000 12
12 14 2001 1
13 13 2001 2
14 14 2001 3
15 12 2001 4
16 12 2001 5
17 5 2001 6
18 18 2001 7
19 13 2001 8
20 7 2001 9
21 18 2001 10
22 12 2001 11
23 10 2001 12
24 14 2002 1
25 14 2002 2
26 20 2002 3
27 20 2002 4
28 9 2002 5
29 15 2002 6
30 15 2002 7
31 5 2002 8
32 7 2002 9
33 5 2002 10
34 6 2002 11
35 7 2002 12
N values by month according to the latitude
1 2 3 4 5 6 7 8 9 10 11 12
lat
0 1.04 0.94 1.04 1.01 1.04 1.01 1.04 1.04 1.01 1.04 1.01 1.04
10 1.00 0.91 1.03 1.03 1.08 1.06 1.08 1.07 1.02 1.02 0.98 0.99
15 0.97 0.91 1.03 1.04 1.11 1.08 1.12 1.08 1.02 1.01 0.95 0.97
20 0.95 0.90 1.03 1.65 1.13 1.11 1.14 1.12 1.02 1.00 0.93 0.94
the code for N values table is:
data2 = {"lat":[0,10,15,20],"1":[1.04,1,0.97,0.95],"2":[0.94,0.91,0.91,0.9],"3":[1.04,1.03,1.03,1.03],
"4":[1.01,1.03,1.04,1.65],"5":[1.04,1.08,1.11,1.13],"6":[1.01,1.06,1.08,1.11],"7":[1.04,1.08,1.12,1.14],"8":[1.04,1.07,1.08,1.12],
"9":[1.01,1.02,1.02,1.02],"10":[1.04,1.02,1.01,1],"11":[1.01,0.98,0.95,0.93],"12":[1.04,0.99,0.97,0.94]}
df2 = pd.DataFrame(data2)
For example if the latitude is 20 and the month is 3 the result in the N column must be 1.03, if the latitude is 11 now and the month is 1 the result in N column must be 0.97 more or less

Using pd.interpolate() method would come extremely handy in your case:
import pandas as pd
data2 = {"lat":[0,10,15,20],"1":[1.04,1,0.97,0.95],"2":[0.94,0.91,0.91,0.9],"3":[1.04,1.03,1.03,1.03],
"4":[1.01,1.03,1.04,1.65],"5":[1.04,1.08,1.11,1.13],"6":[1.01,1.06,1.08,1.11],"7":[1.04,1.08,1.12,1.14],"8":[1.04,1.07,1.08,1.12],
"9":[1.01,1.02,1.02,1.02],"10":[1.04,1.02,1.01,1],"11":[1.01,0.98,0.95,0.93],"12":[1.04,0.99,0.97,0.94]}
df2 = pd.DataFrame(data2)
df2 = df2.set_index('lat')
index_set = df2.index.unique()
for i in range(20):
if i not in index_set:
df2.loc[i] = pd.Series()
df2 = df2.sort_values(by=['lat'])
res_df = df2.interpolate()
tdf = pd.read_csv('try.tsv', sep='\s+', header=None, index_col=None)
tdf.columns = ['id', 'lat', 'year', 'month']
tdf['lat'] = tdf.lat.astype(int)
tdf['N'] = tdf.apply(lambda x: res_df.loc[x['lat'], str(x['month'])], axis=1)
print(tdf)
Output:
id lat year month N
0 0 11 2000 1 0.994
1 1 9 2000 2 0.913
2 2 11 2000 3 1.030
3 3 10 2000 4 1.030
4 4 17 2000 5 1.118
5 5 6 2000 6 1.040
6 6 18 2000 7 1.132
7 7 11 2000 8 1.072
8 8 17 2000 9 1.020
9 9 12 2000 10 1.016
10 10 19 2000 11 0.934
11 11 8 2000 12 1.000
12 12 14 2001 1 0.976
13 13 13 2001 2 0.910
14 14 14 2001 3 1.030
15 15 12 2001 4 1.034
16 16 12 2001 5 1.092
17 17 5 2001 6 1.035
18 18 18 2001 7 1.132
19 19 13 2001 8 1.076
20 20 7 2001 9 1.017
21 21 18 2001 10 1.004
22 22 12 2001 11 0.968
23 23 10 2001 12 0.990
24 24 14 2002 1 0.976
25 25 14 2002 2 0.910
26 26 20 2002 3 1.030
27 27 20 2002 4 1.650
28 28 9 2002 5 1.076
29 29 15 2002 6 1.080
30 30 15 2002 7 1.120
31 31 5 2002 8 1.055
32 32 7 2002 9 1.017
33 33 5 2002 10 1.030
34 34 6 2002 11 0.992
35 35 7 2002 12 1.005

Related

Drop certain rows based on quantity of rows with specific values

I am newer data science and am working on a project to analyze sports statistics. I have a dataset of hockey statistics for a group of players over multiple seasons. Players have anywhere between 1 row to 12 rows representing their season statistics over however many seasons they've played.
Example:
Player Season Pos GP G A P +/- PIM P/GP ... PPG PPP SHG SHP OTG GWG S S% TOI/GP FOW%
0 Nathan MacKinnon 2022 1 65 32 56 88 22 42 1.35 ... 7 27 0 0 1 5 299 10.7 21.07 45.4
1 Nathan MacKinnon 2021 1 48 20 45 65 22 37 1.35 ... 8 25 0 0 0 2 206 9.7 20.37 48.5
2 Nathan MacKinnon 2020 1 69 35 58 93 13 12 1.35 ... 12 31 0 0 2 4 318 11.0 21.22 43.1
3 Nathan MacKinnon 2019 1 82 41 58 99 20 34 1.21 ... 12 37 0 0 1 6 365 11.2 22.08 43.7
4 Nathan MacKinnon 2018 1 74 39 58 97 11 55 1.31 ... 12 32 0 1 3 12 284 13.7 19.90 41.9
5 Nathan MacKinnon 2017 1 82 16 37 53 -14 16 0.65 ... 2 14 2 2 2 4 251 6.4 19.95 50.6
6 Nathan MacKinnon 2016 1 72 21 31 52 -4 20 0.72 ... 7 16 0 1 0 6 245 8.6 18.87 48.4
7 Nathan MacKinnon 2015 1 64 14 24 38 -7 34 0.59 ... 3 7 0 0 0 2 192 7.3 17.05 47.0
8 Nathan MacKinnon 2014 1 82 24 39 63 20 26 0.77 ... 8 17 0 0 0 5 241 10.0 17.35 42.9
9 J.T. Compher 2022 2 70 18 15 33 6 25 0.47 ... 4 6 1 1 0 0 102 17.7 16.32 51.4
10 J.T. Compher 2021 2 48 10 8 18 10 19 0.38 ... 1 2 0 0 0 2 47 21.3 14.22 45.9
11 J.T. Compher 2020 2 67 11 20 31 9 18 0.46 ... 1 5 0 3 1 3 106 10.4 16.75 47.7
12 J.T. Compher 2019 2 66 16 16 32 -8 31 0.48 ... 4 9 3 3 0 3 118 13.6 17.48 49.2
13 J.T. Compher 2018 2 69 13 10 23 -29 20 0.33 ... 4 7 2 2 2 3 131 9.9 16.00 45.1
14 J.T. Compher 2017 2 21 3 2 5 0 4 0.24 ... 1 1 0 0 0 1 30 10.0 14.93 47.6
15 Darren Helm 2022 1 68 7 8 15 -5 14 0.22 ... 0 0 1 2 0 1 93 7.5 10.55 44.2
16 Darren Helm 2021 1 47 3 5 8 -3 10 0.17 ... 0 0 0 0 0 0 83 3.6 14.68 66.7
17 Darren Helm 2020 1 68 9 7 16 -6 37 0.24 ... 0 0 1 2 0 0 102 8.8 13.73 53.6
18 Darren Helm 2019 1 61 7 10 17 -11 20 0.28 ... 0 0 1 4 0 0 107 6.5 14.57 44.4
19 Darren Helm 2018 1 75 13 18 31 3 39 0.41 ... 0 0 2 4 0 0 141 9.2 15.57 44.1
[sample of my dataset][1]
[1]: https://i.stack.imgur.com/7CsUd.png
If any player has played more than 6 seasons, I want to drop the row corresponding to Season 2021. This is because COVID drastically shortened the season and it is causing issues as I work with averages.
As you can see from the screenshot, Nathan MacKinnon has played 9 seasons. Across those 9 seasons, except for 2021, he plays in no fewer than 64 games. Due to the shortened season of 2021, he only got 48 games.
Removing Season 2021 results in an Average Games Played of 73.75.
Keeping Season 2021 in the data, the Average Games Played becomes 70.89.
While not drastic, it compounds into the other metrics as well.
I have been trying this for a little while now, but as I mentioned, I am new to this world and am struggling to figure out how to accomplish this.
I don't want to just completely drop ALL rows for 2021 across all players, though, as some players only have 1-5 years' worth of data and for those players, I need to use as much data as I can and remove 1 row from a player with only 2 seasons would also negatively skew averages.
I would really appreciate some assistance from anyone more experienced than me!
This can be accomplished by using groupby and apply. For example:
edited_players = (players
.groupby("Player")
.apply(lambda subset: subset if len(subset) <= 6 else subset.query("Season != 2021"))
)
Round brackets for formatting purposes.
The combination of groupby and apply basically feeds a grouped subset of your dataframe to a function. So, first all the rows of Nathan MacKinnon will be used, then rows for J.T. Compher, then Darren Helm rows, etc.
The function used is an anonymous/lambda function which operates under the following logic: "if the dataframe subset that I receive has 6 or fewer rows, I'll return the subset unedited. Otherwise, I will filter out rows within that subset which have the value 2021 in the Season column".

In Selenium see if the 'a' anchor tag contains the text I want, and then extract multiple td's of text in the same row

For my python code, I have been trying to scrape data from NCAAF Stats. I have been having issues extracting the td's text after I evaluate if the anchor tag 'a', contains the text I want. I want to be able to find the teams amount of tds, points, and ppg. I have been able to successfully find the school by text in selenium, but after that I am unable to extract the info I want. Here is what I have coded so far.
from selenium import webdriver
driver = webdriver.Chrome('C:\\Users\\Carl\\Downloads\\chromedriver.exe')
driver.get('https://www.ncaa.com/stats/football/fbs/current/team/27')
# I plan to make a while or for loop later, that is why I used f strings
team = "Coastal Carolina"
first = driver.find_element_by_xpath(f'//a[text()="{team}"]')
# This was the way another similiarly asked question was answered but did not work
#tds = driver.find_element_by_xpath(f'//td//a[text()="{apples}"]/../td[4]').text
# This grabs data from the very first row of data... not the one I want
tds = first.find_element_by_xpath('//following-sibling::td[4]').text
total_points = first.find_element_by_xpath('//following-sibling::td[10]').text
ppg = first.find_element_by_xpath('//following-sibling::td[11]').text
print(tds, total_points, ppg)
driver.quit()
I have tried to look around for a similarly asked question and was able to find this snippet
tds = driver.find_element_by_xpath(f'//td//a[text()="{apples}"]/../td[4]').text
it unfortunately did not help me out much. The html structure looks like this. I appreciate any help, and thank you!
No need to use Selenium, the page isn't dynamic. Just use pandas to parse the table for you:
import pandas as pd
url = 'https://www.ncaa.com/stats/football/fbs/current/team/27'
dfs = pd.read_html(url)[0]
Output:
print(df)
Rank Team G TDs PAT 2PT Def Pts FG Saf Pts PPG
0 1 Ohio St. 6 39 39 0 0 6 0 291.0 48.5
1 2 Pittsburgh 6 40 36 0 0 4 1 290.0 48.3
2 3 Coastal Carolina 7 43 42 0 0 6 1 320.0 45.7
3 4 Alabama 7 41 40 1 0 9 0 315.0 45.0
4 5 Ole Miss 6 35 30 1 0 6 1 262.0 43.7
5 6 Cincinnati 6 36 34 1 0 3 0 261.0 43.5
6 7 Oklahoma 7 35 34 1 1 17 0 299.0 42.7
7 - SMU 7 40 36 1 0 7 0 299.0 42.7
8 9 Texas 7 38 37 0 0 8 1 291.0 41.6
9 10 Western Ky. 6 31 27 1 0 10 0 245.0 40.8
10 11 Tennessee 7 36 36 0 0 7 1 275.0 39.3
11 12 Wake Forest 6 28 24 2 0 12 0 232.0 38.7
12 13 UTSA 7 33 33 0 0 13 0 270.0 38.6
13 14 Michigan 6 28 25 1 0 12 0 231.0 38.5
14 15 Georgia 7 34 33 0 0 10 1 269.0 38.4
15 16 Baylor 7 35 35 0 0 7 1 268.0 38.3
16 17 Houston 6 30 28 0 0 5 0 223.0 37.2
17 - TCU 6 29 28 0 0 7 0 223.0 37.2
18 19 Marshall 7 34 33 0 0 7 0 258.0 36.9
19 - North Carolina 7 34 32 2 0 6 0 258.0 36.9
20 21 Nevada 6 26 24 1 0 12 0 218.0 36.3
21 22 Virginia 7 31 29 2 0 10 2 253.0 36.1
22 23 Fresno St. 7 32 27 1 0 10 0 251.0 35.9
23 - Memphis 7 33 26 3 0 7 0 251.0 35.9
24 25 Texas Tech 7 32 31 0 0 9 0 250.0 35.7
25 26 Auburn 7 29 28 1 0 12 1 242.0 34.6
26 27 Florida 7 33 29 1 0 4 0 241.0 34.4
27 - Missouri 7 31 31 0 0 8 0 241.0 34.4
28 29 Liberty 7 33 29 1 0 3 1 240.0 34.3
29 - Michigan St. 7 30 30 0 0 10 0 240.0 34.3
30 31 UCF 6 28 26 0 0 3 1 205.0 34.2
31 32 Oregon St. 6 27 27 0 0 5 0 204.0 34.0
32 33 Oregon 6 26 26 0 0 7 0 203.0 33.8
33 34 Iowa St. 6 23 22 0 0 14 0 202.0 33.7
34 35 UCLA 7 30 28 0 0 9 0 235.0 33.6
35 36 San Diego St. 6 25 24 1 0 7 0 197.0 32.8
36 37 LSU 7 29 29 0 0 8 0 227.0 32.4
37 38 Louisville 6 24 23 0 0 9 0 194.0 32.3
38 - Miami (FL) 6 24 22 1 0 8 1 194.0 32.3
39 - NC State 6 25 24 0 0 6 1 194.0 32.3
40 41 Southern California 6 22 19 3 0 12 0 193.0 32.2
41 42 Tulane 7 31 23 4 0 2 0 223.0 31.9
42 43 Arizona St. 7 30 25 2 0 4 0 221.0 31.6
43 44 Utah 6 25 22 1 0 5 0 189.0 31.5
44 45 Air Force 7 29 27 1 0 5 1 220.0 31.4
45 46 App State 7 27 24 0 0 11 0 219.0 31.3
46 47 Arkansas 7 27 25 0 0 10 0 217.0 31.0
47 - Army West Point 6 25 22 0 0 4 1 186.0 31.0
48 - Notre Dame 6 23 20 2 0 8 0 186.0 31.0
49 - Western Mich. 7 28 25 0 0 8 0 217.0 31.0

Pull specific values from one dataframe based on values in another

I have two dataframes
df1:
Country
value
Average
Week Rank
UK
42
42
1
US
9
9.5
2
DE
10
9.5
3
NL
15
15.5
4
ESP
16
15.5
5
POL
17
18
6
CY
18
18
7
IT
20
18
8
AU
17
18
9
FI
18
18
10
SW
20
18
11
df2:
Country
value
Average
Year Rank
US
42
42
1
UK
9
9.5
2
ESP
10
9.5
3
SW
15
15.5
4
IT
16
15.5
5
POL
17
18
6
NO
18
18
7
SL
20
18
8
PO
17
18
9
FI
18
18
10
NL
20
18
11
DE
17
18
12
AU
18
18
13
CY
20
18
14
Im looking to create a column in df1 that shows the 'Year Rank' of the countries in df1 so that I have the following:
Country
value
Average
Week Rank
Year Rank
UK
42
42
1
2
US
9
9.5
2
1
DE
10
9.5
3
9
NL
15
15.5
4
8
ESP
16
15.5
5
3
POL
17
18
6
6
CY
18
18
7
7
IT
20
18
8
5
AU
17
18
9
13
FI
18
18
10
10
SW
20
18
11
4
How would i loop through the countries in df1 and find the corresponding rank in df2?
Edit: I am only looking for the yearly rank of the countries in df1
Thanks!
Use:
df1['Year Rank'] = df1.merge(df2, on='Country')['Year Rank']

Sum values of a column for each value based on another column and divide it by total

Today I'm struggling once again with Python and data-analytics.
I got a dataframe which looks like this:
name totdmgdealt
0 Warwick 96980.0
1 Nami 25995.0
2 Draven 171568.0
3 Fiora 113721.0
4 Viktor 185302.0
5 Skarner 148791.0
6 Galio 130692.0
7 Ahri 145731.0
8 Jinx 182680.0
9 VelKoz 85785.0
10 Ziggs 46790.0
11 Cassiopeia 62444.0
12 Yasuo 117896.0
13 Warwick 129156.0
14 Evelynn 179252.0
15 Caitlyn 163342.0
16 Wukong 122919.0
17 Syndra 146754.0
18 Karma 35766.0
19 Warwick 117790.0
20 Draven 74879.0
21 Janna 11242.0
22 Lux 66424.0
23 Amumu 87826.0
24 Vayne 76085.0
25 Ahri 93334.0
..
..
..
this is a dataframe, which includes the total damage of a champion for one game.
Now I want to group these information, so I can see which champion overall has the most damage dealt.
I tried groupby('name') but it didn't work at all.
I already went through some threads about groupby and summing values, but I didn't solve my specific problem.
The dealt damage of each champion should also be shown as percentage of the total.
I'm looking for something like this as an output:
name totdmgdealt percentage
0 Warwick 2378798098 2.1 %
1 Nami 2837491074 2.3 %
2 Draven 1231451224 ..
3 Fiora 1287301724 ..
4 Viktor 1239808504 ..
5 Skarner 1487911234 ..
6 Galio 1306921234 ..
We can groupby on name and get the sum then we divide each value by the total with .div and multiply it by 100 with .mul and finally round it to one decimal with .round:
total = df['totdmgdealt'].sum()
summed = df.groupby('name', sort=False)['totdmgdealt'].sum().reset_index()
summed['percentage'] = summed.groupby('name', sort=False)['totdmgdealt']\
.sum()\
.div(total)\
.mul(100)\
.round(1).values
name totdmgdealt percentage
0 Warwick 343926.0 12.2
1 Nami 25995.0 0.9
2 Draven 246447.0 8.7
3 Fiora 113721.0 4.0
4 Viktor 185302.0 6.6
5 Skarner 148791.0 5.3
6 Galio 130692.0 4.6
7 Ahri 239065.0 8.5
8 Jinx 182680.0 6.5
9 VelKoz 85785.0 3.0
10 Ziggs 46790.0 1.7
11 Cassiopeia 62444.0 2.2
12 Yasuo 117896.0 4.2
13 Evelynn 179252.0 6.4
14 Caitlyn 163342.0 5.8
15 Wukong 122919.0 4.4
16 Syndra 146754.0 5.2
17 Karma 35766.0 1.3
18 Janna 11242.0 0.4
19 Lux 66424.0 2.4
20 Amumu 87826.0 3.1
21 Vayne 76085.0 2.7
you can use sum() to get the total dmg, and apply to calculate the precent relevant for each row, like this:
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO("""
name totdmgdealt
0 Warwick 96980.0
1 Nami 25995.0
2 Draven 171568.0
3 Fiora 113721.0
4 Viktor 185302.0
5 Skarner 148791.0
6 Galio 130692.0
7 Ahri 145731.0
8 Jinx 182680.0
9 VelKoz 85785.0
10 Ziggs 46790.0
11 Cassiopeia 62444.0
12 Yasuo 117896.0
13 Warwick 129156.0
14 Evelynn 179252.0
15 Caitlyn 163342.0
16 Wukong 122919.0
17 Syndra 146754.0
18 Karma 35766.0
19 Warwick 117790.0
20 Draven 74879.0
21 Janna 11242.0
22 Lux 66424.0
23 Amumu 87826.0
24 Vayne 76085.0
25 Ahri 93334.0"""), sep=r"\s+")
summed_df = df.groupby('name')['totdmgdealt'].agg(['sum']).rename(columns={"sum": "totdmgdealt"}).reset_index()
summed_df['percentage'] = summed_df.apply(
lambda x: "{:.2f}%".format(x['totdmgdealt'] / summed_df['totdmgdealt'].sum() * 100), axis=1)
print(summed_df)
Output:
name totdmgdealt percentage
0 Ahri 239065.0 8.48%
1 Amumu 87826.0 3.12%
2 Caitlyn 163342.0 5.79%
3 Cassiopeia 62444.0 2.21%
4 Draven 246447.0 8.74%
5 Evelynn 179252.0 6.36%
6 Fiora 113721.0 4.03%
7 Galio 130692.0 4.64%
8 Janna 11242.0 0.40%
9 Jinx 182680.0 6.48%
10 Karma 35766.0 1.27%
11 Lux 66424.0 2.36%
12 Nami 25995.0 0.92%
13 Skarner 148791.0 5.28%
14 Syndra 146754.0 5.21%
15 Vayne 76085.0 2.70%
16 VelKoz 85785.0 3.04%
17 Viktor 185302.0 6.57%
18 Warwick 343926.0 12.20%
19 Wukong 122919.0 4.36%
20 Yasuo 117896.0 4.18%
21 Ziggs 46790.0 1.66%
Maybe You can Try this:
I tried to achieve the same using my sample data and try to run the below code into your Jupyter Notebook:
import pandas as pd
name=['abhit','mawa','vaibhav','dharam','sid','abhit','vaibhav','sid','mawa','lakshya']
totdmgdealt=[24,45,80,22,89,55,89,51,93,85]
name=pd.Series(name,name='name') #converting into series
totdmgdealt=pd.Series(totdmgdealt,name='totdmgdealt') #converting into series
data=pd.concat([name,totdmgdealt],axis=1)
data=pd.DataFrame(data) #converting into Dataframe
final=data.pivot_table(values="totdmgdealt",columns="name",aggfunc="sum").transpose() #actual aggregating method
total=data['totdmgdealt'].sum() #calculating total for calculating percentage
def calPer(row,total): #actual Function for Percentage
return ((row/total)*100).round(2)
total=final['totdmgdealt'].sum()
final['Percentage']=calPer(final['totdmgdealt'],total) #assigning the function to the column
final
Sample Data :
name totdmgdealt
0 abhit 24
1 mawa 45
2 vaibhav 80
3 dharam 22
4 sid 89
5 abhit 55
6 vaibhav 89
7 sid 51
8 mawa 93
9 lakshya 85
Output:
totdmgdealt Percentage
name
abhit 79 12.48
dharam 22 3.48
lakshya 85 13.43
mawa 138 21.80
sid 140 22.12
vaibhav 169 26.70
Understand and run the code and just replace the dataset with Yours. Maybe This Helps.

Average of last 13 months for each record in pandas

I am trying to calculate the average of the last 13 months for each month for P1 and P2. Here is a sample of the data:
P1 P2
Month
May-16 4 24
Jun-16 2 9
Jul-16 4 20
Aug-16 2 12
Sep-16 7 8
Oct-16 7 11
Nov-16 0 4
Dec-16 3 18
Jan-17 4 9
Feb-17 9 16
Mar-17 2 13
Apr-17 9 9
May-17 5 13
Jun-17 9 16
Jul-17 5 11
Aug-17 6 11
Sep-17 8 13
Oct-17 6 12
Nov-17 9 21
Dec-17 4 12
Jan-18 2 12
Feb-18 7 17
Mar-18 5 15
Apr-18 3 13
May-18 7 25
Jun-18 5 23
I am trying to create this table:
P1 P2 AVGP1 AVGP2
Month
Jun-17 9 16 4.85 11.23
Jul-17 5 11 5.08 11.38
Aug-17 6 11 5.23 11.54
Sep-17 8 13 5.69 11.54
Oct-17 6 12 5.62 11.85
Nov-17 9 21 5.77 12.46
Dec-17 4 12 6.08 13.08
Jan-18 2 12 6.00 12.62
Feb-18 7 17 6.23 13.23
Mar-18 5 15 5.92 13.23
Apr-18 3 13 6.00 13.23
May-18 7 25 5.85 14.46
Jun-18 5 23 5.85 15.23
The goal is to create a dataframe with the above table. I can't figure out how to make a function that will calculate only the last 13 months of data. Any help would be great!
You can use pd.DataFrame.rolling followed by dropna:
res = df.join(df.rolling(13).mean().add_prefix('AVG')).dropna(how='any')
print(res)
P1 P2 AVGP1 AVGP2
Month
May-17 5 13 4.461538 12.769231
Jun-17 9 16 4.846154 12.153846
Jul-17 5 11 5.076923 12.307692
Aug-17 6 11 5.230769 11.615385
Sep-17 8 13 5.692308 11.692308
Oct-17 6 12 5.615385 12.000000
Nov-17 9 21 5.769231 12.769231
Dec-17 4 12 6.076923 13.384615
Jan-18 2 12 6.000000 12.923077
Feb-18 7 17 6.230769 13.538462
Mar-18 5 15 5.923077 13.461538
Apr-18 3 13 6.000000 13.461538
May-18 7 25 5.846154 14.692308
Jun-18 5 23 5.846154 15.461538

Categories