This is my available df, it contains year from 2016 to 2020
Year Month Bill
-----------------
2016 1 2
2016 2 5
2016 3 10
2016 4 2
2016 5 4
2016 6 9
2016 7 7
2016 8 8
2016 9 9
2016 10 5
2016 11 1
2016 12 3
.
.
.
2020 12 10
Now I want to create a 2 new columns in this dataframe Level and Contribution.
and the level column contain Q1, Q2, Q3, Q4 representing 4 quarters of the year and Contribution contains average value from bill column of each quarter in those 3 months of the respective year.
for example Q1 for 2016 will contains the average of month 1,2,3 of bill across **Contribution**
and same for Q3 for year 2020 will contains average of month 7,8,9 of the year 2020 bill column in the Contribution Column, expected Dataframe is given below
Year Month Bill levels contribution
------------------------------------
2016 1 2 Q1 5.66
2016 2 5 Q1 5.66
2016 3 10 Q1 5.66
2016 4 2 Q2 5
2016 5 4 Q2 5
2016 6 9 Q2 5
2016 7 7 Q3 8
2016 8 8 Q3 8
2016 9 9 Q3 8
2016 10 5 Q4 3
2016 11 1 Q4 3
2016 12 3 Q4 3
.
.
2020 10 2 Q4 6
2020 11 6 Q4 6
2020 12 10 Q4 6
This process will be repeated for each month 4 quarters
Iam not able to figure out the as it is something new to me
You can try:
df['levels'] = 'Q' + df['Month'].div(3).apply(math.ceil).astype(str)
df['contribution'] = df.groupby(['Year', 'levels'])['Bill'].transform('mean')
Pandas actually has a set of datatypes for monthly and quarterly values called Pandas.Period.
See this similar question:
How to create a period range with year and week of year in python?
In your case it would look like this:
from datetime import datetime
# First create dates for 1st of each month period
df['Dates'] = [datetime(row['Year'], row['Month'], 1) for i, row in df[['Year', 'Month']].iterrows()]
# Create monthly periods
df['Month Periods'] = df['Dates'].dt.to_period('M')
# Use the new monthly index
df = df.set_index('Month Periods')
# Group by quarters
df_qtrly = df['Bill'].resample('Q').mean()
df_qtrly.index.names = ['Quarters']
print(df_qtrly)
Result:
Quarters
2016Q1 5.666667
2016Q2 5.000000
2016Q3 8.000000
2016Q4 3.000000
Freq: Q-DEC, Name: Bill, dtype: float64
If you want to put these values back into the monthly dataframe you could do this:
df['Quarters'] = df['Dates'].dt.to_period('Q')
df['Contributions'] = df_qtrly.loc[df['Quarters']].values
Year Month Bill Dates Quarters Contributions
Month Periods
2016-01 2016 1 2 2016-01-01 2016Q1 5.666667
2016-02 2016 2 5 2016-02-01 2016Q1 5.666667
2016-03 2016 3 10 2016-03-01 2016Q1 5.666667
2016-04 2016 4 2 2016-04-01 2016Q2 5.000000
2016-05 2016 5 4 2016-05-01 2016Q2 5.000000
2016-06 2016 6 9 2016-06-01 2016Q2 5.000000
2016-07 2016 7 7 2016-07-01 2016Q3 8.000000
2016-08 2016 8 8 2016-08-01 2016Q3 8.000000
2016-09 2016 9 9 2016-09-01 2016Q3 8.000000
2016-10 2016 10 5 2016-10-01 2016Q4 3.000000
2016-11 2016 11 1 2016-11-01 2016Q4 3.000000
2016-12 2016 12 3 2016-12-01 2016Q4 3.000000
Related
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I am trying to create a new column on an existing dataframe based on values of another dataframe.
# Define a dataframe containing 2 columns Date-Year and Date-Qtr
data1 = {'Date-Year': [2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2017, 2017],
'Date-Qtr': ['2015Q1', '2015Q2', '2015Q3', '2015Q4', '2016Q1', '2016Q2', '2016Q3', '2016Q4', '2017Q1', '2017Q2']}
dfx = pd.DataFrame(data1)
# Define another dataframe containing 2 columns Date-Year and Interest Rate
data2 = {'Date-Year': [2000, 2015, 2016, 2017, 2018, 2019, 2020, 2021],
'Interest Rate': [0.00, 8.20, 8.20, 7.75, 7.50, 7.50, 6.50, 6.50]}
dfy = pd.DataFrame(data2)
# Add 1 more column to the first dataframe
dfx['Int-rate'] = float(0)
Output for dfx
Date-Year Date-Qtr Int-rate
0 2015 2015Q1 0.0
1 2015 2015Q2 0.0
2 2015 2015Q3 0.0
3 2015 2015Q4 0.0
4 2016 2016Q1 0.0
5 2016 2016Q2 0.0
6 2016 2016Q3 0.0
7 2016 2016Q4 0.0
8 2017 2017Q1 0.0
9 2017 2017Q2 0.0
Output for dfy
Date-Year Interest Rate
0 2000 0.00
1 2015 8.20
2 2016 8.20
3 2017 7.75
4 2018 7.50
5 2019 7.50
6 2020 6.50
7 2021 6.50
Now I need to update the column 'Int-rate' of dfx by picking up the value for 'Interest Rate' from dfy for its corresponding year which I am achieving through 2 FOR loops
#Check the year from dfx - goto dfy - check the interest rate from dfy for that year and modify Int-rate of dfx with this value
for i in range (len(dfx['Date-Year'])):
for j in range (len(dfy['Date-Year'])):
if (dfx['Date-Year'][i] == dfy['Date-Year'][j]):
dfx['Int-rate'][i] = dfy['Interest Rate'][j]
and I get the desired output
Date-Year Date-Qtr Int-rate
0 2015 2015Q1 8.20
1 2015 2015Q2 8.20
2 2015 2015Q3 8.20
3 2015 2015Q4 8.20
4 2016 2016Q1 8.20
5 2016 2016Q2 8.20
6 2016 2016Q3 8.20
7 2016 2016Q4 8.20
8 2017 2017Q1 7.75
9 2017 2017Q2 7.75
Is there a way I can achieve the same output
without declaring dfx['Int-rate'] = float(0). I get a KeyError: 'Int-rate'if I don't declare this
not very happy with the 2 FOR loops. Is it possible to get it done in a better way (like using map or merge or joins)
I have tried looking through other posts and the best one I found is here, tried using map but I could not do it. Any help will be appreciated
thanks
You could use replace with a dictionary:
dfx['Int-Rate'] = dfx['Date-Year'].replace(dict(dfy.to_numpy()))
print(dfx)
Output
Date-Year Date-Qtr Int-Rate
0 2015 2015Q1 8.20
1 2015 2015Q2 8.20
2 2015 2015Q3 8.20
3 2015 2015Q4 8.20
4 2016 2016Q1 8.20
5 2016 2016Q2 8.20
6 2016 2016Q3 8.20
7 2016 2016Q4 8.20
8 2017 2017Q1 7.75
9 2017 2017Q2 7.75
Or with a Series as an alternative:
dfx['Int-Rate'] = dfx['Date-Year'].replace(dfy.set_index('Date-Year').squeeze())
You can simply use df.merge:
In [4448]: df = dfx.merge(dfy).rename(columns={'Interest Rate':'Int-rate'})
In [4449]: df
Out[4449]:
Date-Year Date-Qtr Int-rate
0 2015 2015Q1 8.20
1 2015 2015Q2 8.20
2 2015 2015Q3 8.20
3 2015 2015Q4 8.20
4 2016 2016Q1 8.20
5 2016 2016Q2 8.20
6 2016 2016Q3 8.20
7 2016 2016Q4 8.20
8 2017 2017Q1 7.75
9 2017 2017Q2 7.75
I have a csv file containing few attributes and one of them is the star ratings of different restaurants etoiles (means star in french). Here annee means the year when the rating was made.
note: I dont know how to share a Jupyter notebook table output in here, i tried different command lines but the format was always ugly. If someone could help with that.
And what i want to do is pretty simple (I think).. I want to add a new column that represents the standard deviation of the mean of stars per year of a restaurant. So I must estimate the average stars rating per year. Then, calculate the standard deviation on these values. But, I dont really know the syntax using pandas that will allow me to calculate the average star rating of a restaurant per year. Any suggestions?
I understand that I need to group the restaurants by year with .groupby('restaurant_id')['annee'] and then take the average of the stars rating of the restaurant during that year but i dont know how to write it.
# does not work
avis['newColumn'] = (
avis.groupby(['restaurant_id', 'annee'])['etoiles'].mean().std()
)
Here is a potential solution with groupby:
#generating test data
dates = pd.date_range('20130101', periods=36, freq='M')
year = dates.strftime('%Y')
df = pd.DataFrame([np.random.randint(1,10) for x in range(36)],columns=['Rating'])
df['restaurants'] = ['R_{}'.format(i) for i in range(4)]*9
df['date'] = dates
df['year'] = year
print(df)
rating restaurants date year
0 8 R_0 2013-01-31 2013
1 7 R_1 2013-02-28 2013
2 1 R_2 2013-03-31 2013
3 6 R_3 2013-04-30 2013
4 4 R_0 2013-05-31 2013
5 8 R_1 2013-06-30 2013
6 7 R_2 2013-07-31 2013
7 5 R_3 2013-08-31 2013
8 4 R_0 2013-09-30 2013
9 5 R_1 2013-10-31 2013
10 4 R_2 2013-11-30 2013
11 8 R_3 2013-12-31 2013
12 9 R_0 2014-01-31 2014
13 6 R_1 2014-02-28 2014
14 3 R_2 2014-03-31 2014
15 6 R_3 2014-04-30 2014
16 2 R_0 2014-05-31 2014
17 8 R_1 2014-06-30 2014
18 1 R_2 2014-07-31 2014
19 5 R_3 2014-08-31 2014
20 1 R_0 2014-09-30 2014
21 7 R_1 2014-10-31 2014
22 3 R_2 2014-11-30 2014
23 4 R_3 2014-12-31 2014
24 2 R_0 2015-01-31 2015
25 4 R_1 2015-02-28 2015
26 8 R_2 2015-03-31 2015
27 7 R_3 2015-04-30 2015
28 3 R_0 2015-05-31 2015
29 1 R_1 2015-06-30 2015
30 2 R_2 2015-07-31 2015
31 8 R_3 2015-08-31 2015
32 7 R_0 2015-09-30 2015
33 5 R_1 2015-10-31 2015
34 3 R_2 2015-11-30 2015
35 3 R_3 2015-12-31 2015
#df['date'] = pd.to_datetime(df['date']) #more versatile
#df.set_index('dates') #more versatile
#df.groupby([pd.Grouper(freq='1Y'),'restraunts'])['Rating'].mean() #more versatile
df = df.groupby(['year','restaurants']).agg({'Rating':[np.mean,np.std]})
print(df)
Output:
Rating Rating
year restaurants mean std
2013 R_0 5.333333 2.309401
R_1 6.666667 1.527525
R_2 4.000000 3.000000
R_3 6.333333 1.527525
2014 R_0 4.000000 4.358899
R_1 7.000000 1.000000
R_2 2.333333 1.154701
R_3 5.000000 1.000000
2015 R_0 4.000000 2.645751
R_1 3.333333 2.081666
R_2 4.333333 3.214550
R_3 6.000000 2.645751
EDIT:
Renaming columns:
df.columns = ['Mean','STD']
df.reset_index(inplace=True)
year restaurant Mean STD
0 2013 R_0 1.333333 0.577350
1 2013 R_1 5.333333 3.511885
2 2013 R_2 1.333333 0.577350
3 2013 R_3 4.333333 2.886751
4 2014 R_0 3.000000 1.000000
5 2014 R_1 3.666667 2.886751
6 2014 R_2 4.333333 4.041452
7 2014 R_3 5.333333 2.081666
8 2015 R_0 6.000000 2.645751
9 2015 R_1 6.333333 3.785939
10 2015 R_2 6.333333 3.785939
11 2015 R_3 5.666667 3.055050
You can calculate the standard deviation of the mean of stars per year by:
df.groupby('annes')['etoiles'].mean().std()
Let me know if it worked.
I am working with a large panel data of financial info, however the values are a bit spotty. I am trying to calculate the return between each year of each stock in my panel data. However, because of missing values sometimes firms have year gaps, making the: df['stock_ret'] = df.groupby(['tic'])['stock_price'].pct_change() impossible to practice as it would be wrong. The df looks something like this (just giving an example):
datadate month fyear ticker price
0 31/12/1998 12 1998 AAPL 188.92
1 31/12/1999 12 1999 AAPL 197.44
2 31/12/2002 12 2002 AAPL 268.13
3 31/12/2003 12 2003 AAPL 278.06
4 31/12/2004 12 2004 AAPL 288.35
5 31/12/2005 12 2005 AAPL 312.23
6 31/05/2008 5 2008 TSLA 45.67
7 31/05/2009 5 2009 TSLA 38.29
8 31/05/2010 5 2010 TSLA 42.89
9 31/05/2011 5 2011 TSLA 56.03
10 31/05/2014 5 2014 TSLA 103.45
.. ... .. .. .. ..
What I am looking for is a piece of code that would allow me to understand (for each individual firm) if there is any gap in the data, and calculate returns for the two different series. Just like this:
datadate month fyear ticker price return
0 31/12/1998 12 1998 AAPL 188.92 NaN
1 31/12/1999 12 1999 AAPL 197.44 0.0451
2 31/12/2002 12 2002 AAPL 268.13 NaN
3 31/12/2003 12 2003 AAPL 278.06 0.0370
4 31/12/2004 12 2004 AAPL 288.35 0.0370
5 31/12/2005 12 2005 AAPL 312.23 0.0828
6 31/05/2008 5 2008 TSLA 45.67 NaN
7 31/05/2009 5 2009 TSLA 38.29 -0.1616
8 31/05/2010 5 2010 TSLA 42.89 0.1201
9 31/05/2011 5 2011 TSLA 56.03 0.3063
10 31/05/2014 5 2014 TSLA 103.45 NaN
.. ... .. .. .. ..
If you have any other suggestions on how to treat this problem, please feel free to share your knowledge :) I am a bit inexperienced so I am sure that your advice could help!
Thank you in advance guys!
You can create a mask that tells if the last year existed and just update those years with pct change:
df['return'] = np.nan
mask = df.groupby('ticker')['fyear'].apply(lambda x: x.shift(1)==x-1)
df.loc[mask,'return'] = df.groupby('ticker')['price'].pct_change()
I have a dataframe with information about sales of some products (unit):
unit year month price
0 1 2018 6 100
1 1 2013 4 70
2 2 2015 10 80
3 2 2015 2 110
4 3 2017 4 120
5 3 2002 6 90
6 4 2016 1 55
and I would like to add, for each sale, columns with information about the previous sales and NaN if there is no previous sale.
unit year month price prev_price prev_year prev_month
0 1 2018 6 100 70.0 2013.0 4.0
1 1 2013 4 70 NaN NaN NaN
2 2 2015 10 80 110.0 2015.0 2.0
3 2 2015 2 110 NaN NaN NaN
4 3 2017 4 120 90.0 2002.0 6.0
5 3 2002 6 90 NaN NaN NaN
6 4 2016 1 55 NaN NaN NaN
For the moment I am doing some grouping on the unit, keeping those that have several rows, then extracting the information for these units that are associated with the minimal date. Then joining this table with my original table keeping only the rows that have a different date in the 2 tables that have been merged.
I feel like there is a much simple way to do this but I am not sure how.
Use DataFrameGroupBy.shift with add_prefix and join to append new DataFrame to original:
#if real data are not sorted
#df = df.sort_values(['unit','year','month'], ascending=[True, False, False])
df = df.join(df.groupby('unit', sort=False).shift(-1).add_prefix('prev_'))
print (df)
unit year month price prev_year prev_month prev_price
0 1 2018 6 100 2013.0 4.0 70.0
1 1 2013 4 70 NaN NaN NaN
2 2 2015 10 80 2015.0 2.0 110.0
3 2 2015 2 110 NaN NaN NaN
4 3 2017 4 120 2002.0 6.0 90.0
5 3 2002 6 90 NaN NaN NaN
6 4 2016 1 55 NaN NaN NaN
I want to create 2 Running Total columns that ONLY aggregate the Amount values based on whether TYPE is ANNUAL or MONTHLY within each Deal
so it would be DF.groupby(['Deal','Booking Month']) then somehow apply a sum function when TYPE==ANNUAL for the first column and TYPE==MONTHLY for the second column.
This if what my grouped DF looks like + the two Desired Columns.
Deal TYPE Month Amount Running Total(ANNUAL) Running Total(Monthly)
A ANNUAL April 1000 1000 0
A ANNUAL April 2000 3000 0
A MONTHLY June 1500 3000 1500
B MONTHLY April 11150 0 11150
B ANNUAL July 700 700 11150
B ANNUAL August 303.63 1003.63 11150
C ANNUAL April 25624.59 25624.59 0
D ANNUAL June 5000 5000 0
D ANNUAL July 5000 10000 0
D ANNUAL August 5000 15000 0
E ANNUAL April 10 10 0
E MONTHLY May 1000 10 1000
E ANNUAL May 500 510 1000
E MONTHLY June 500.00 510 1500
E ANNUAL June 600 1110 1500
E MONTHLY July 300 1110 1800
E MONTHLY July 8200 1110 10000
Use filters and groupby + transform:
mask = df.TYPE.eq('ANNUAL')
cols = ['Running Total(ANNUAL)','Running Total(MONTHLY)']
df.loc[mask,'Running Total(ANNUAL)'] = df.loc[mask,'Amount']
df.loc[~mask,'Running Total(MONTHLY)'] = df.loc[~mask,'Amount']
df[cols] = df[cols].fillna(0)
df[cols] = df.groupby(['Deal'])['Running Total(ANNUAL)','Running Total(MONTHLY)'].transform('cumsum')
print(df)
Deal TYPE Month Amount Running Total(ANNUAL) \
0 A ANNUAL April 1000.00 1000.00
1 A ANNUAL April 2000.00 3000.00
2 A MONTHLY June 1500.00 3000.00
3 B MONTHLY April 11150.00 0.00
4 B ANNUAL July 700.00 700.00
5 B ANNUAL August 303.63 1003.63
6 C ANNUAL April 25624.59 25624.59
7 D ANNUAL June 5000.00 5000.00
8 D ANNUAL July 5000.00 10000.00
9 D ANNUAL August 5000.00 15000.00
10 E ANNUAL April 10.00 10.00
11 E MONTHLY May 1000.00 10.00
12 E ANNUAL May 500.00 510.00
13 E MONTHLY June 500.00 510.00
14 E ANNUAL June 600.00 1110.00
15 E MONTHLY July 300.00 1110.00
16 E MONTHLY July 8200.00 1110.00
Running Total(MONTHLY)
0 0.0
1 0.0
2 1500.0
3 11150.0
4 11150.0
5 11150.0
6 0.0
7 0.0
8 0.0
9 0.0
10 0.0
11 1000.0
12 1000.0
13 1500.0
14 1500.0
15 1800.0
16 10000.0
You can do this with .expanding.sum() which will maintain a multiIndex of the groups that you can unstack to get separate columns for each type. Use another groupby to fill the missing values within each group accordingly. Concatenate it back.
The nice thing about this is that it can be done for arbitrarily many types, without needing to define them anywhere explicitly.
import pandas as pd
df2 = (df.groupby(['Deal', 'TYPE'])
.Amount.expanding().sum()
.unstack(level=1)
.groupby(level=0)
.ffill().fillna(0)
.reset_index(level=0, drop=True)
.drop(columns='Deal'))
pd.concat([df, df2], axis=1)
Output
Deal TYPE Month Amount ANNUAL MONTHLY
0 A ANNUAL April 1000.00 1000.00 0.0
1 A ANNUAL April 2000.00 3000.00 0.0
2 A MONTHLY June 1500.00 3000.00 1500.0
3 B MONTHLY April 11150.00 0.00 11150.0
4 B ANNUAL July 700.00 700.00 11150.0
5 B ANNUAL August 303.63 1003.63 11150.0
6 C ANNUAL April 25624.59 25624.59 0.0
7 D ANNUAL June 5000.00 5000.00 0.0
8 D ANNUAL July 5000.00 10000.00 0.0
9 D ANNUAL August 5000.00 15000.00 0.0
10 E ANNUAL April 10.00 10.00 0.0
11 E MONTHLY May 1000.00 10.00 1000.0
12 E ANNUAL May 500.00 510.00 1000.0
13 E MONTHLY June 500.00 510.00 1500.0
14 E ANNUAL June 600.00 1110.00 1500.0
15 E MONTHLY July 300.00 1110.00 1800.0
16 E MONTHLY July 8200.00 1110.00 10000.0