Is there a way to record the sample size when calling the mean() method of a groupby object?
Consider the following dataframe:
In [16]: df
Out[16]:
formation phi sw
0 nio 14 47
1 nio 10 16
2 nio 12 12
3 nio 19 82
4 nio 23 43
5 fthays 24 19
6 codell 23 5
7 codell 24 45
8 codell 9 11
9 graneros 26 11
10 graneros 15 45
11 graneros 12 16
12 dkot 11 79
It's easy enough to compute the mean across each of these formations using the mean() method of the groupby object:
In [17]: df.groupby(['formation']).mean()
Out[17]:
phi sw
formation
codell 18.666667 20.333333
dkot 11.000000 79.000000
fthays 24.000000 19.000000
graneros 17.666667 24.000000
nio 15.600000 40.000000
But I'd like to know if there's a way to add a column for the sample size. So my desired output would be something like:
phi sw n
formation
codell 18.666667 20.333333 3
dkot 11.000000 79.000000 1
fthays 24.000000 19.000000 1
graneros 17.666667 24.000000 3
nio 15.600000 40.000000 5
You can do this by using the aggregate function, with the mean and count functions as arguments.
>> df.groupby(['formation']).aggregate([np.mean, np.size])
Related
Today I'm struggling once again with Python and data-analytics.
I got a dataframe which looks like this:
name totdmgdealt
0 Warwick 96980.0
1 Nami 25995.0
2 Draven 171568.0
3 Fiora 113721.0
4 Viktor 185302.0
5 Skarner 148791.0
6 Galio 130692.0
7 Ahri 145731.0
8 Jinx 182680.0
9 VelKoz 85785.0
10 Ziggs 46790.0
11 Cassiopeia 62444.0
12 Yasuo 117896.0
13 Warwick 129156.0
14 Evelynn 179252.0
15 Caitlyn 163342.0
16 Wukong 122919.0
17 Syndra 146754.0
18 Karma 35766.0
19 Warwick 117790.0
20 Draven 74879.0
21 Janna 11242.0
22 Lux 66424.0
23 Amumu 87826.0
24 Vayne 76085.0
25 Ahri 93334.0
..
..
..
this is a dataframe, which includes the total damage of a champion for one game.
Now I want to group these information, so I can see which champion overall has the most damage dealt.
I tried groupby('name') but it didn't work at all.
I already went through some threads about groupby and summing values, but I didn't solve my specific problem.
The dealt damage of each champion should also be shown as percentage of the total.
I'm looking for something like this as an output:
name totdmgdealt percentage
0 Warwick 2378798098 2.1 %
1 Nami 2837491074 2.3 %
2 Draven 1231451224 ..
3 Fiora 1287301724 ..
4 Viktor 1239808504 ..
5 Skarner 1487911234 ..
6 Galio 1306921234 ..
We can groupby on name and get the sum then we divide each value by the total with .div and multiply it by 100 with .mul and finally round it to one decimal with .round:
total = df['totdmgdealt'].sum()
summed = df.groupby('name', sort=False)['totdmgdealt'].sum().reset_index()
summed['percentage'] = summed.groupby('name', sort=False)['totdmgdealt']\
.sum()\
.div(total)\
.mul(100)\
.round(1).values
name totdmgdealt percentage
0 Warwick 343926.0 12.2
1 Nami 25995.0 0.9
2 Draven 246447.0 8.7
3 Fiora 113721.0 4.0
4 Viktor 185302.0 6.6
5 Skarner 148791.0 5.3
6 Galio 130692.0 4.6
7 Ahri 239065.0 8.5
8 Jinx 182680.0 6.5
9 VelKoz 85785.0 3.0
10 Ziggs 46790.0 1.7
11 Cassiopeia 62444.0 2.2
12 Yasuo 117896.0 4.2
13 Evelynn 179252.0 6.4
14 Caitlyn 163342.0 5.8
15 Wukong 122919.0 4.4
16 Syndra 146754.0 5.2
17 Karma 35766.0 1.3
18 Janna 11242.0 0.4
19 Lux 66424.0 2.4
20 Amumu 87826.0 3.1
21 Vayne 76085.0 2.7
you can use sum() to get the total dmg, and apply to calculate the precent relevant for each row, like this:
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO("""
name totdmgdealt
0 Warwick 96980.0
1 Nami 25995.0
2 Draven 171568.0
3 Fiora 113721.0
4 Viktor 185302.0
5 Skarner 148791.0
6 Galio 130692.0
7 Ahri 145731.0
8 Jinx 182680.0
9 VelKoz 85785.0
10 Ziggs 46790.0
11 Cassiopeia 62444.0
12 Yasuo 117896.0
13 Warwick 129156.0
14 Evelynn 179252.0
15 Caitlyn 163342.0
16 Wukong 122919.0
17 Syndra 146754.0
18 Karma 35766.0
19 Warwick 117790.0
20 Draven 74879.0
21 Janna 11242.0
22 Lux 66424.0
23 Amumu 87826.0
24 Vayne 76085.0
25 Ahri 93334.0"""), sep=r"\s+")
summed_df = df.groupby('name')['totdmgdealt'].agg(['sum']).rename(columns={"sum": "totdmgdealt"}).reset_index()
summed_df['percentage'] = summed_df.apply(
lambda x: "{:.2f}%".format(x['totdmgdealt'] / summed_df['totdmgdealt'].sum() * 100), axis=1)
print(summed_df)
Output:
name totdmgdealt percentage
0 Ahri 239065.0 8.48%
1 Amumu 87826.0 3.12%
2 Caitlyn 163342.0 5.79%
3 Cassiopeia 62444.0 2.21%
4 Draven 246447.0 8.74%
5 Evelynn 179252.0 6.36%
6 Fiora 113721.0 4.03%
7 Galio 130692.0 4.64%
8 Janna 11242.0 0.40%
9 Jinx 182680.0 6.48%
10 Karma 35766.0 1.27%
11 Lux 66424.0 2.36%
12 Nami 25995.0 0.92%
13 Skarner 148791.0 5.28%
14 Syndra 146754.0 5.21%
15 Vayne 76085.0 2.70%
16 VelKoz 85785.0 3.04%
17 Viktor 185302.0 6.57%
18 Warwick 343926.0 12.20%
19 Wukong 122919.0 4.36%
20 Yasuo 117896.0 4.18%
21 Ziggs 46790.0 1.66%
Maybe You can Try this:
I tried to achieve the same using my sample data and try to run the below code into your Jupyter Notebook:
import pandas as pd
name=['abhit','mawa','vaibhav','dharam','sid','abhit','vaibhav','sid','mawa','lakshya']
totdmgdealt=[24,45,80,22,89,55,89,51,93,85]
name=pd.Series(name,name='name') #converting into series
totdmgdealt=pd.Series(totdmgdealt,name='totdmgdealt') #converting into series
data=pd.concat([name,totdmgdealt],axis=1)
data=pd.DataFrame(data) #converting into Dataframe
final=data.pivot_table(values="totdmgdealt",columns="name",aggfunc="sum").transpose() #actual aggregating method
total=data['totdmgdealt'].sum() #calculating total for calculating percentage
def calPer(row,total): #actual Function for Percentage
return ((row/total)*100).round(2)
total=final['totdmgdealt'].sum()
final['Percentage']=calPer(final['totdmgdealt'],total) #assigning the function to the column
final
Sample Data :
name totdmgdealt
0 abhit 24
1 mawa 45
2 vaibhav 80
3 dharam 22
4 sid 89
5 abhit 55
6 vaibhav 89
7 sid 51
8 mawa 93
9 lakshya 85
Output:
totdmgdealt Percentage
name
abhit 79 12.48
dharam 22 3.48
lakshya 85 13.43
mawa 138 21.80
sid 140 22.12
vaibhav 169 26.70
Understand and run the code and just replace the dataset with Yours. Maybe This Helps.
I have a simple dataframe df that contains three columns:
Time: expressed in seconds
A: set of values that can vary between -inf to +inf
B: set of angles (degrees) which range between 0 and 359
Here is the dataframe
df = pd.DataFrame({'Time':[0,12,23,25,44,50], 'A':[5,7,9,8,11,6], 'B':[300,358,4,10,2,350]})
And it looks like this:
Time A B
0 0 5 300
1 12 7 358
2 23 9 4
3 25 8 10
4 44 11 2
5 50 6 350
My idea is to interpolate the data from 0 to 50 seconds and I was able to achieve my goal using the following lines of code:
y = pd.DataFrame({'Time':list(range(df['Time'].iloc[0], df['Time'].iloc[-1]))})
df = pd.merge(left=y, right=df, on='Time', how='left').interpolate()
Problem: even though column A is interpolated correctly, column B is wrong because the interpolation of an angle between 360 degrees is not performed! Here is an example:
Time A B
12 12 7.000000 358.000000
13 13 7.181818 325.818182
14 14 7.363636 293.636364
15 15 7.545455 261.454545
16 16 7.727273 229.272727
17 17 7.909091 197.090909
18 18 8.090909 164.909091
19 19 8.272727 132.727273
20 20 8.454545 100.545455
21 21 8.636364 68.363636
22 22 8.818182 36.181818
23 23 9.000000 4.000000
Question: can you suggest me a smart and efficient way to solve this issue and being able to interpolate correctly the angles between 0/360 degrees?
You should be able to use the method described in this question for the angle column:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Time':[0,12,23,25,44,50], 'A':[5,7,9,8,11,6], 'B':[300,358,4,10,2,350]})
df['B'] = np.rad2deg(np.unwrap(np.deg2rad(df['B'])))
y = pd.DataFrame({'Time':list(range(df['Time'].iloc[0], df['Time'].iloc[-1]))})
df = pd.merge(left=y, right=df, on='Time', how='left').interpolate()
df['B'] %= 360
print(df)
Output:
Time A B
0 0 5.000000 300.000000
1 1 5.166667 304.833333
2 2 5.333333 309.666667
3 3 5.500000 314.500000
4 4 5.666667 319.333333
5 5 5.833333 324.166667
6 6 6.000000 329.000000
7 7 6.166667 333.833333
8 8 6.333333 338.666667
9 9 6.500000 343.500000
10 10 6.666667 348.333333
11 11 6.833333 353.166667
12 12 7.000000 358.000000
13 13 7.181818 358.545455
14 14 7.363636 359.090909
15 15 7.545455 359.636364
16 16 7.727273 0.181818
17 17 7.909091 0.727273
18 18 8.090909 1.272727
19 19 8.272727 1.818182
20 20 8.454545 2.363636
21 21 8.636364 2.909091
22 22 8.818182 3.454545
23 23 9.000000 4.000000
24 24 8.500000 7.000000
25 25 8.000000 10.000000
26 26 8.157895 9.578947
27 27 8.315789 9.157895
28 28 8.473684 8.736842
29 29 8.631579 8.315789
30 30 8.789474 7.894737
31 31 8.947368 7.473684
32 32 9.105263 7.052632
33 33 9.263158 6.631579
34 34 9.421053 6.210526
35 35 9.578947 5.789474
36 36 9.736842 5.368421
37 37 9.894737 4.947368
38 38 10.052632 4.526316
39 39 10.210526 4.105263
40 40 10.368421 3.684211
41 41 10.526316 3.263158
42 42 10.684211 2.842105
43 43 10.842105 2.421053
44 44 11.000000 2.000000
45 45 11.000000 2.000000
46 46 11.000000 2.000000
47 47 11.000000 2.000000
48 48 11.000000 2.000000
49 49 11.000000 2.000000
I am trying to calculate the average of the last 13 months for each month for P1 and P2. Here is a sample of the data:
P1 P2
Month
May-16 4 24
Jun-16 2 9
Jul-16 4 20
Aug-16 2 12
Sep-16 7 8
Oct-16 7 11
Nov-16 0 4
Dec-16 3 18
Jan-17 4 9
Feb-17 9 16
Mar-17 2 13
Apr-17 9 9
May-17 5 13
Jun-17 9 16
Jul-17 5 11
Aug-17 6 11
Sep-17 8 13
Oct-17 6 12
Nov-17 9 21
Dec-17 4 12
Jan-18 2 12
Feb-18 7 17
Mar-18 5 15
Apr-18 3 13
May-18 7 25
Jun-18 5 23
I am trying to create this table:
P1 P2 AVGP1 AVGP2
Month
Jun-17 9 16 4.85 11.23
Jul-17 5 11 5.08 11.38
Aug-17 6 11 5.23 11.54
Sep-17 8 13 5.69 11.54
Oct-17 6 12 5.62 11.85
Nov-17 9 21 5.77 12.46
Dec-17 4 12 6.08 13.08
Jan-18 2 12 6.00 12.62
Feb-18 7 17 6.23 13.23
Mar-18 5 15 5.92 13.23
Apr-18 3 13 6.00 13.23
May-18 7 25 5.85 14.46
Jun-18 5 23 5.85 15.23
The goal is to create a dataframe with the above table. I can't figure out how to make a function that will calculate only the last 13 months of data. Any help would be great!
You can use pd.DataFrame.rolling followed by dropna:
res = df.join(df.rolling(13).mean().add_prefix('AVG')).dropna(how='any')
print(res)
P1 P2 AVGP1 AVGP2
Month
May-17 5 13 4.461538 12.769231
Jun-17 9 16 4.846154 12.153846
Jul-17 5 11 5.076923 12.307692
Aug-17 6 11 5.230769 11.615385
Sep-17 8 13 5.692308 11.692308
Oct-17 6 12 5.615385 12.000000
Nov-17 9 21 5.769231 12.769231
Dec-17 4 12 6.076923 13.384615
Jan-18 2 12 6.000000 12.923077
Feb-18 7 17 6.230769 13.538462
Mar-18 5 15 5.923077 13.461538
Apr-18 3 13 6.000000 13.461538
May-18 7 25 5.846154 14.692308
Jun-18 5 23 5.846154 15.461538
I need to convert this kind of Year variable in Panda
Year Sales
0 1-01 266.0
1 1-02 145.9
2 1-03 183.1
3 1-04 119.3
4 1-05 180.3
5 1-06 168.5
6 1-07 231.8
7 1-08 224.5
8 1-09 192.8
9 1-10 122.9
10 1-11 336.5
11 1-12 185.9
12 2-01 194.3
13 2-02 149.5
14 2-03 210.1
15 2-04 273.3
16 2-05 191.4
17 2-06 287.0
18 2-07 226.0
19 2-08 303.6
20 2-09 289.9
21 2-10 421.6
22 2-11 264.5
23 2-12 342.3
24 3-01 339.7
25 3-02 440.4
26 3-03 315.9
27 3-04 439.3
in oder to use it for sci kit learn or others methods to predict, I do not know how should i do it, as it has nonstandard format of date
i used
pd.datetime.strptime(dates, '%m-%d')
but It does not work,
can you tell what would be your approach
bests
A B C D E
0 2002-01-13 Dan 2002-01-15 26 -1
1 2002-01-13 Dan 2002-01-15 10 0
2 2002-01-13 Dan 2002-01-15 16 1
3 2002-01-13 Vic 2002-01-17 14 0
4 2002-01-13 Vic 2002-01-03 18 0
5 2002-01-28 Mel 2002-02-08 37 0
6 2002-01-28 Mel 2002-02-06 29 0
7 2002-01-28 Mel 2002-02-10 20 0
8 2002-01-28 Rob 2002-02-12 30 -1
9 2002-01-28 Rob 2002-02-12 48 1
10 2002-01-28 Rob 2002-02-12 0 1
11 2002-01-28 Rob 2002-02-01 19 0
Wen answered a very similar question an hour ago, but I forgot to include some conditions. I´ll write them down in bold style:
I want to create a new df['F'] column, with next conditions, per each B group and ignoring zeros in D column:
F=D value, where A dates are nearest to 10 days later than C date and where E=0.
If E=0 doesn´t exist in the nearest A date to 10 days (case of 2002-01-28 Rob), F will be the mean of D values when E=-1 and E=1.
If there are two C dates at the same distance to 10 days from A (case of 2002-01-28 Mel), F will be the mean of these same-period D values.
Output should be:
A B C D E F
0 2002-01-13 Dan 2002-01-15 26 -1 10
1 2002-01-13 Dan 2002-01-15 10 0 10
2 2002-01-13 Dan 2002-01-15 16 1 10
3 2002-01-13 Vic 2002-01-17 14 0 14
4 2002-01-13 Vic 2002-01-03 18 0 14
5 2002-01-28 Mel 2002-02-08 37 0 33
6 2002-01-28 Mel 2002-02-06 29 0 33
7 2002-01-28 Mel 2002-02-10 20 0 33
8 2002-01-28 Rob 2002-02-12 30 -1 39
9 2002-01-28 Rob 2002-02-12 48 1 39
10 2002-01-28 Rob 2002-02-12 0 1 39
11 2002-01-28 Rob 2002-02-01 19 0 39
Wen answered:
df['F']=abs((df.C-df.A).dt.days-10)# get the days different
df['F']=df.B.map(df.loc[df.F==df.groupby('B').F.transform('min')].groupby('B').D.mean())# find the min value for the different , and get the mean
df
But now I can´t get to insert the new conditions (that I´ve put in bold style).
Change the mapper to
m=df.loc[(df.F==df.groupby('B').F.transform('min'))&(df.D!=0)].groupby('B').apply(lambda x : x['D'][x['E']==0].mean() if (x['E']==0).any() else x['D'].mean())
df['F']=df.B.map(m)