Pandas: How to calculate the average of a groupby - python

I have a csv file containing few attributes and one of them is the star ratings of different restaurants etoiles (means star in french). Here annee means the year when the rating was made.
note: I dont know how to share a Jupyter notebook table output in here, i tried different command lines but the format was always ugly. If someone could help with that.
And what i want to do is pretty simple (I think).. I want to add a new column that represents the standard deviation of the mean of stars per year of a restaurant. So I must estimate the average stars rating per year. Then, calculate the standard deviation on these values. But, I dont really know the syntax using pandas that will allow me to calculate the average star rating of a restaurant per year. Any suggestions?
I understand that I need to group the restaurants by year with .groupby('restaurant_id')['annee'] and then take the average of the stars rating of the restaurant during that year but i dont know how to write it.
# does not work
avis['newColumn'] = (
avis.groupby(['restaurant_id', 'annee'])['etoiles'].mean().std()
)

Here is a potential solution with groupby:
#generating test data
dates = pd.date_range('20130101', periods=36, freq='M')
year = dates.strftime('%Y')
df = pd.DataFrame([np.random.randint(1,10) for x in range(36)],columns=['Rating'])
df['restaurants'] = ['R_{}'.format(i) for i in range(4)]*9
df['date'] = dates
df['year'] = year
print(df)
rating restaurants date year
0 8 R_0 2013-01-31 2013
1 7 R_1 2013-02-28 2013
2 1 R_2 2013-03-31 2013
3 6 R_3 2013-04-30 2013
4 4 R_0 2013-05-31 2013
5 8 R_1 2013-06-30 2013
6 7 R_2 2013-07-31 2013
7 5 R_3 2013-08-31 2013
8 4 R_0 2013-09-30 2013
9 5 R_1 2013-10-31 2013
10 4 R_2 2013-11-30 2013
11 8 R_3 2013-12-31 2013
12 9 R_0 2014-01-31 2014
13 6 R_1 2014-02-28 2014
14 3 R_2 2014-03-31 2014
15 6 R_3 2014-04-30 2014
16 2 R_0 2014-05-31 2014
17 8 R_1 2014-06-30 2014
18 1 R_2 2014-07-31 2014
19 5 R_3 2014-08-31 2014
20 1 R_0 2014-09-30 2014
21 7 R_1 2014-10-31 2014
22 3 R_2 2014-11-30 2014
23 4 R_3 2014-12-31 2014
24 2 R_0 2015-01-31 2015
25 4 R_1 2015-02-28 2015
26 8 R_2 2015-03-31 2015
27 7 R_3 2015-04-30 2015
28 3 R_0 2015-05-31 2015
29 1 R_1 2015-06-30 2015
30 2 R_2 2015-07-31 2015
31 8 R_3 2015-08-31 2015
32 7 R_0 2015-09-30 2015
33 5 R_1 2015-10-31 2015
34 3 R_2 2015-11-30 2015
35 3 R_3 2015-12-31 2015
#df['date'] = pd.to_datetime(df['date']) #more versatile
#df.set_index('dates') #more versatile
#df.groupby([pd.Grouper(freq='1Y'),'restraunts'])['Rating'].mean() #more versatile
df = df.groupby(['year','restaurants']).agg({'Rating':[np.mean,np.std]})
print(df)
Output:
Rating Rating
year restaurants mean std
2013 R_0 5.333333 2.309401
R_1 6.666667 1.527525
R_2 4.000000 3.000000
R_3 6.333333 1.527525
2014 R_0 4.000000 4.358899
R_1 7.000000 1.000000
R_2 2.333333 1.154701
R_3 5.000000 1.000000
2015 R_0 4.000000 2.645751
R_1 3.333333 2.081666
R_2 4.333333 3.214550
R_3 6.000000 2.645751
EDIT:
Renaming columns:
df.columns = ['Mean','STD']
df.reset_index(inplace=True)
year restaurant Mean STD
0 2013 R_0 1.333333 0.577350
1 2013 R_1 5.333333 3.511885
2 2013 R_2 1.333333 0.577350
3 2013 R_3 4.333333 2.886751
4 2014 R_0 3.000000 1.000000
5 2014 R_1 3.666667 2.886751
6 2014 R_2 4.333333 4.041452
7 2014 R_3 5.333333 2.081666
8 2015 R_0 6.000000 2.645751
9 2015 R_1 6.333333 3.785939
10 2015 R_2 6.333333 3.785939
11 2015 R_3 5.666667 3.055050

You can calculate the standard deviation of the mean of stars per year by:
df.groupby('annes')['etoiles'].mean().std()
Let me know if it worked.

Related

Converting the available pandas Dataframe presently across monthly into quarterly values

This is my available df, it contains year from 2016 to 2020
Year Month Bill
-----------------
2016 1 2
2016 2 5
2016 3 10
2016 4 2
2016 5 4
2016 6 9
2016 7 7
2016 8 8
2016 9 9
2016 10 5
2016 11 1
2016 12 3
.
.
.
2020 12 10
Now I want to create a 2 new columns in this dataframe Level and Contribution.
and the level column contain Q1, Q2, Q3, Q4 representing 4 quarters of the year and Contribution contains average value from bill column of each quarter in those 3 months of the respective year.
for example Q1 for 2016 will contains the average of month 1,2,3 of bill across **Contribution**
and same for Q3 for year 2020 will contains average of month 7,8,9 of the year 2020 bill column in the Contribution Column, expected Dataframe is given below
Year Month Bill levels contribution
------------------------------------
2016 1 2 Q1 5.66
2016 2 5 Q1 5.66
2016 3 10 Q1 5.66
2016 4 2 Q2 5
2016 5 4 Q2 5
2016 6 9 Q2 5
2016 7 7 Q3 8
2016 8 8 Q3 8
2016 9 9 Q3 8
2016 10 5 Q4 3
2016 11 1 Q4 3
2016 12 3 Q4 3
.
.
2020 10 2 Q4 6
2020 11 6 Q4 6
2020 12 10 Q4 6
This process will be repeated for each month 4 quarters
Iam not able to figure out the as it is something new to me
You can try:
df['levels'] = 'Q' + df['Month'].div(3).apply(math.ceil).astype(str)
df['contribution'] = df.groupby(['Year', 'levels'])['Bill'].transform('mean')
Pandas actually has a set of datatypes for monthly and quarterly values called Pandas.Period.
See this similar question:
How to create a period range with year and week of year in python?
In your case it would look like this:
from datetime import datetime
# First create dates for 1st of each month period
df['Dates'] = [datetime(row['Year'], row['Month'], 1) for i, row in df[['Year', 'Month']].iterrows()]
# Create monthly periods
df['Month Periods'] = df['Dates'].dt.to_period('M')
# Use the new monthly index
df = df.set_index('Month Periods')
# Group by quarters
df_qtrly = df['Bill'].resample('Q').mean()
df_qtrly.index.names = ['Quarters']
print(df_qtrly)
Result:
Quarters
2016Q1 5.666667
2016Q2 5.000000
2016Q3 8.000000
2016Q4 3.000000
Freq: Q-DEC, Name: Bill, dtype: float64
If you want to put these values back into the monthly dataframe you could do this:
df['Quarters'] = df['Dates'].dt.to_period('Q')
df['Contributions'] = df_qtrly.loc[df['Quarters']].values
Year Month Bill Dates Quarters Contributions
Month Periods
2016-01 2016 1 2 2016-01-01 2016Q1 5.666667
2016-02 2016 2 5 2016-02-01 2016Q1 5.666667
2016-03 2016 3 10 2016-03-01 2016Q1 5.666667
2016-04 2016 4 2 2016-04-01 2016Q2 5.000000
2016-05 2016 5 4 2016-05-01 2016Q2 5.000000
2016-06 2016 6 9 2016-06-01 2016Q2 5.000000
2016-07 2016 7 7 2016-07-01 2016Q3 8.000000
2016-08 2016 8 8 2016-08-01 2016Q3 8.000000
2016-09 2016 9 9 2016-09-01 2016Q3 8.000000
2016-10 2016 10 5 2016-10-01 2016Q4 3.000000
2016-11 2016 11 1 2016-11-01 2016Q4 3.000000
2016-12 2016 12 3 2016-12-01 2016Q4 3.000000

How to create a new column on an existing Dataframe based on values of another Dataframe both of which are of different length [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I am trying to create a new column on an existing dataframe based on values of another dataframe.
# Define a dataframe containing 2 columns Date-Year and Date-Qtr
data1 = {'Date-Year': [2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2017, 2017],
'Date-Qtr': ['2015Q1', '2015Q2', '2015Q3', '2015Q4', '2016Q1', '2016Q2', '2016Q3', '2016Q4', '2017Q1', '2017Q2']}
dfx = pd.DataFrame(data1)
# Define another dataframe containing 2 columns Date-Year and Interest Rate
data2 = {'Date-Year': [2000, 2015, 2016, 2017, 2018, 2019, 2020, 2021],
'Interest Rate': [0.00, 8.20, 8.20, 7.75, 7.50, 7.50, 6.50, 6.50]}
dfy = pd.DataFrame(data2)
# Add 1 more column to the first dataframe
dfx['Int-rate'] = float(0)
Output for dfx
Date-Year Date-Qtr Int-rate
0 2015 2015Q1 0.0
1 2015 2015Q2 0.0
2 2015 2015Q3 0.0
3 2015 2015Q4 0.0
4 2016 2016Q1 0.0
5 2016 2016Q2 0.0
6 2016 2016Q3 0.0
7 2016 2016Q4 0.0
8 2017 2017Q1 0.0
9 2017 2017Q2 0.0
Output for dfy
Date-Year Interest Rate
0 2000 0.00
1 2015 8.20
2 2016 8.20
3 2017 7.75
4 2018 7.50
5 2019 7.50
6 2020 6.50
7 2021 6.50
Now I need to update the column 'Int-rate' of dfx by picking up the value for 'Interest Rate' from dfy for its corresponding year which I am achieving through 2 FOR loops
#Check the year from dfx - goto dfy - check the interest rate from dfy for that year and modify Int-rate of dfx with this value
for i in range (len(dfx['Date-Year'])):
for j in range (len(dfy['Date-Year'])):
if (dfx['Date-Year'][i] == dfy['Date-Year'][j]):
dfx['Int-rate'][i] = dfy['Interest Rate'][j]
and I get the desired output
Date-Year Date-Qtr Int-rate
0 2015 2015Q1 8.20
1 2015 2015Q2 8.20
2 2015 2015Q3 8.20
3 2015 2015Q4 8.20
4 2016 2016Q1 8.20
5 2016 2016Q2 8.20
6 2016 2016Q3 8.20
7 2016 2016Q4 8.20
8 2017 2017Q1 7.75
9 2017 2017Q2 7.75
Is there a way I can achieve the same output
without declaring dfx['Int-rate'] = float(0). I get a KeyError: 'Int-rate'if I don't declare this
not very happy with the 2 FOR loops. Is it possible to get it done in a better way (like using map or merge or joins)
I have tried looking through other posts and the best one I found is here, tried using map but I could not do it. Any help will be appreciated
thanks
You could use replace with a dictionary:
dfx['Int-Rate'] = dfx['Date-Year'].replace(dict(dfy.to_numpy()))
print(dfx)
Output
Date-Year Date-Qtr Int-Rate
0 2015 2015Q1 8.20
1 2015 2015Q2 8.20
2 2015 2015Q3 8.20
3 2015 2015Q4 8.20
4 2016 2016Q1 8.20
5 2016 2016Q2 8.20
6 2016 2016Q3 8.20
7 2016 2016Q4 8.20
8 2017 2017Q1 7.75
9 2017 2017Q2 7.75
Or with a Series as an alternative:
dfx['Int-Rate'] = dfx['Date-Year'].replace(dfy.set_index('Date-Year').squeeze())
You can simply use df.merge:
In [4448]: df = dfx.merge(dfy).rename(columns={'Interest Rate':'Int-rate'})
In [4449]: df
Out[4449]:
Date-Year Date-Qtr Int-rate
0 2015 2015Q1 8.20
1 2015 2015Q2 8.20
2 2015 2015Q3 8.20
3 2015 2015Q4 8.20
4 2016 2016Q1 8.20
5 2016 2016Q2 8.20
6 2016 2016Q3 8.20
7 2016 2016Q4 8.20
8 2017 2017Q1 7.75
9 2017 2017Q2 7.75

Pandas Panel Data - Identifying year gap and calculating returns

I am working with a large panel data of financial info, however the values are a bit spotty. I am trying to calculate the return between each year of each stock in my panel data. However, because of missing values sometimes firms have year gaps, making the: df['stock_ret'] = df.groupby(['tic'])['stock_price'].pct_change() impossible to practice as it would be wrong. The df looks something like this (just giving an example):
datadate month fyear ticker price
0 31/12/1998 12 1998 AAPL 188.92
1 31/12/1999 12 1999 AAPL 197.44
2 31/12/2002 12 2002 AAPL 268.13
3 31/12/2003 12 2003 AAPL 278.06
4 31/12/2004 12 2004 AAPL 288.35
5 31/12/2005 12 2005 AAPL 312.23
6 31/05/2008 5 2008 TSLA 45.67
7 31/05/2009 5 2009 TSLA 38.29
8 31/05/2010 5 2010 TSLA 42.89
9 31/05/2011 5 2011 TSLA 56.03
10 31/05/2014 5 2014 TSLA 103.45
.. ... .. .. .. ..
What I am looking for is a piece of code that would allow me to understand (for each individual firm) if there is any gap in the data, and calculate returns for the two different series. Just like this:
datadate month fyear ticker price return
0 31/12/1998 12 1998 AAPL 188.92 NaN
1 31/12/1999 12 1999 AAPL 197.44 0.0451
2 31/12/2002 12 2002 AAPL 268.13 NaN
3 31/12/2003 12 2003 AAPL 278.06 0.0370
4 31/12/2004 12 2004 AAPL 288.35 0.0370
5 31/12/2005 12 2005 AAPL 312.23 0.0828
6 31/05/2008 5 2008 TSLA 45.67 NaN
7 31/05/2009 5 2009 TSLA 38.29 -0.1616
8 31/05/2010 5 2010 TSLA 42.89 0.1201
9 31/05/2011 5 2011 TSLA 56.03 0.3063
10 31/05/2014 5 2014 TSLA 103.45 NaN
.. ... .. .. .. ..
If you have any other suggestions on how to treat this problem, please feel free to share your knowledge :) I am a bit inexperienced so I am sure that your advice could help!
Thank you in advance guys!
You can create a mask that tells if the last year existed and just update those years with pct change:
df['return'] = np.nan
mask = df.groupby('ticker')['fyear'].apply(lambda x: x.shift(1)==x-1)
df.loc[mask,'return'] = df.groupby('ticker')['price'].pct_change()

Rolling Mean with Groupby object in Pandas returns null

I have this initial DataFrame in Pandas
A B C D E
0 23 2015 1 14937 16.25
1 23 2015 1 19054 7.50
2 23 2015 2 14937 16.75
3 23 2015 2 19054 17.25
4 23 2015 3 14937 71.75
5 23 2015 3 19054 15.00
6 23 2015 4 14937 13.00
7 23 2015 4 19054 37.75
8 23 2015 5 14937 4.25
9 23 2015 5 19054 18.25
10 23 2015 6 14937 16.50
11 23 2015 6 19054 1.00
I create a Groupby Object because I would like to obtain a roolling mean group by columns A,B,C,D
DfGby = Df.groupby(['A','B', 'C','D'])
After I execute rolling mean
DfMean = pd.DataFrame(DfGby.rolling(center=False,window=3)['E'].mean())
But I obtain
E
A B C D
23 2015 1 14937 0 NaN
19054 1 NaN
2 14937 2 NaN
19054 3 NaN
3 14937 4 NaN
19054 5 NaN
4 14937 6 NaN
19054 7 NaN
5 14937 8 NaN
19054 9 NaN
6 14937 10 NaN
19054 11 NaN
What is the problem here?
If I want to obtain this result, how could I do it?
A B C D E
0 23 2015 1 14937 NaN
1 23 2015 2 14937 NaN
2 23 2015 2 14937 16.6
3 23 2015 1 14937 35.1
4 23 2015 2 14937 33.8
5 23 2015 3 14937 29.7
6 23 2015 4 14937 11.3
7 23 2015 4 19054 NaN
8 23 2015 5 19054 NaN
9 23 2015 5 19054 13.3
10 23 2015 6 19054 23.3
11 23 2015 6 19054 23.7
12 23 2015 6 19054 19.0

How to groupby two fields in pandas?

Given following input, the goal is to group values by hour for each Date with Avg and Sum functions.
Solution to grouping it by hour is here, but it does not consider new days.
Date Time F1 F2 F3
21-01-16 8:11 5 2 4
21-01-16 9:25 9 8 2
21-01-16 9:39 7 3 2
21-01-16 9:53 6 5 1
21-01-16 10:07 4 6 7
21-01-16 10:21 7 3 1
21-01-16 10:35 5 6 7
21-01-16 11:49 1 2 1
21-01-16 12:03 3 3 1
22-01-16 9:45 6 5 1
22-01-16 9:20 4 6 7
22-01-16 12:10 7 3 1
Expected output:
Date,Time,SUM F1,SUM F2,SUM F3,AVG F1,AVG F2,AVG F3
21-01-16,8:00,5,2,4,5,2,4
21-01-16,9:00,22,16,5,7.3,5.3,1.6
21-01-16,10:00,16,15,15,5.3,5,5
21-01-16,11:00,1,2,1,1,2,1
21-01-16,12:00,3,3,1,3,3,1
22-01-16,9:00,10,11,8,5,5.5,4
22-01-16,12:00,7,3,1,7,3,1
You can do the parsing of dates during reading of the csv file:
from __future__ import print_function # make it work with Python 2 and 3
df = pd.read_csv('f123_dates.csv', index_col=0, parse_dates=[0, 1],
delim_whitespace=True)
print(df.groupby([df.index, df.Time.dt.hour]).agg(['mean','sum']))
Output:
F1 F2 F3
mean sum mean sum mean sum
Date Time
2016-01-21 8 5.000000 5 2.000000 2 4.000000 4
9 7.333333 22 5.333333 16 1.666667 5
10 5.333333 16 5.000000 15 5.000000 15
11 1.000000 1 2.000000 2 1.000000 1
12 3.000000 3 3.000000 3 1.000000 1
2016-01-22 9 5.000000 10 5.500000 11 4.000000 8
12 7.000000 7 3.000000 3 1.000000 1
All the way into csv:
from __future__ import print_function
df = pd.read_csv('f123_dates.csv', index_col=0, parse_dates=[0, 1],
delim_whitespace=True)
df2 = df.groupby([df.index, df.Time.dt.hour]).agg(['mean','sum'])
df3 = df2.reset_index()
df3.columns = [' '.join(col).strip() for col in df3.columns.values]
print(df3.to_csv(columns=df3.columns, index=False))
Output:
Date,Time,F1 mean,F1 sum,F2 mean,F2 sum,F3 mean,F3 sum
2016-01-21,8,5.0,5,2.0,2,4.0,4
2016-01-21,9,7.333333333333333,22,5.333333333333333,16,1.6666666666666667,5
2016-01-21,10,5.333333333333333,16,5.0,15,5.0,15
2016-01-21,11,1.0,1,2.0,2,1.0,1
2016-01-21,12,3.0,3,3.0,3,1.0,1
2016-01-22,9,5.0,10,5.5,11,4.0,8
2016-01-22,12,7.0,7,3.0,3,1.0,1
You cas use convert time to datetime by to_datetime and then groupby with agg:
print df
Date Time F1 F2 F3
0 2016-01-21 8:11 5 2 4
1 2016-01-21 9:25 9 8 2
2 2016-01-21 9:39 7 3 2
3 2016-01-21 9:53 6 5 1
4 2016-01-21 10:07 4 6 7
5 2016-01-21 10:21 7 3 1
6 2016-01-21 10:35 5 6 7
7 2016-01-21 11:49 1 2 1
8 2016-01-21 12:03 3 3 1
9 2016-01-22 9:45 6 5 1
10 2016-01-22 9:20 4 6 7
11 2016-01-22 12:10 7 3 1
df['Time'] = pd.to_datetime(df['Time'], format="%H:%M")
print df
Date Time F1 F2 F3
0 2016-01-21 1900-01-01 08:11:00 5 2 4
1 2016-01-21 1900-01-01 09:25:00 9 8 2
2 2016-01-21 1900-01-01 09:39:00 7 3 2
3 2016-01-21 1900-01-01 09:53:00 6 5 1
4 2016-01-21 1900-01-01 10:07:00 4 6 7
5 2016-01-21 1900-01-01 10:21:00 7 3 1
6 2016-01-21 1900-01-01 10:35:00 5 6 7
7 2016-01-21 1900-01-01 11:49:00 1 2 1
8 2016-01-21 1900-01-01 12:03:00 3 3 1
9 2016-01-22 1900-01-01 09:45:00 6 5 1
10 2016-01-22 1900-01-01 09:20:00 4 6 7
11 2016-01-22 1900-01-01 12:10:00 7 3 1
df = df.groupby([df['Date'], df['Time'].dt.hour]).agg(['mean','sum']).reset_index()
print df
Date Time F1 F2 F3
mean sum mean sum mean sum
0 2016-01-21 8 5.000000 5 2.000000 2 4.000000 4
1 2016-01-21 9 7.333333 22 5.333333 16 1.666667 5
2 2016-01-21 10 5.333333 16 5.000000 15 5.000000 15
3 2016-01-21 11 1.000000 1 2.000000 2 1.000000 1
4 2016-01-21 12 3.000000 3 3.000000 3 1.000000 1
5 2016-01-22 9 5.000000 10 5.500000 11 4.000000 8
6 2016-01-22 12 7.000000 7 3.000000 3 1.000000 1
And then you can set column names by list comprehension:
levels = df.columns.levels
labels = df.columns.labels
df.columns = [ x + " " + y for x, y in zip(levels[0][labels[0]],df.columns.droplevel(0))]
print df
Date Time F1 mean F1 sum F2 mean F2 sum F3 mean F3 sum
0 2016-01-21 8 5.000000 5 2.000000 2 4.000000 4
1 2016-01-21 9 7.333333 22 5.333333 16 1.666667 5
2 2016-01-21 10 5.333333 16 5.000000 15 5.000000 15
3 2016-01-21 11 1.000000 1 2.000000 2 1.000000 1
4 2016-01-21 12 3.000000 3 3.000000 3 1.000000 1
5 2016-01-22 9 5.000000 10 5.500000 11 4.000000 8
6 2016-01-22 12 7.000000 7 3.000000 3 1.000000 1

Categories