Python Groupby Running Total/Cumsum column based on string in another column - python

I want to create 2 Running Total columns that ONLY aggregate the Amount values based on whether TYPE is ANNUAL or MONTHLY within each Deal
so it would be DF.groupby(['Deal','Booking Month']) then somehow apply a sum function when TYPE==ANNUAL for the first column and TYPE==MONTHLY for the second column.
This if what my grouped DF looks like + the two Desired Columns.
Deal TYPE Month Amount Running Total(ANNUAL) Running Total(Monthly)
A ANNUAL April 1000 1000 0
A ANNUAL April 2000 3000 0
A MONTHLY June 1500 3000 1500
B MONTHLY April 11150 0 11150
B ANNUAL July 700 700 11150
B ANNUAL August 303.63 1003.63 11150
C ANNUAL April 25624.59 25624.59 0
D ANNUAL June 5000 5000 0
D ANNUAL July 5000 10000 0
D ANNUAL August 5000 15000 0
E ANNUAL April 10 10 0
E MONTHLY May 1000 10 1000
E ANNUAL May 500 510 1000
E MONTHLY June 500.00 510 1500
E ANNUAL June 600 1110 1500
E MONTHLY July 300 1110 1800
E MONTHLY July 8200 1110 10000

Use filters and groupby + transform:
mask = df.TYPE.eq('ANNUAL')
cols = ['Running Total(ANNUAL)','Running Total(MONTHLY)']
df.loc[mask,'Running Total(ANNUAL)'] = df.loc[mask,'Amount']
df.loc[~mask,'Running Total(MONTHLY)'] = df.loc[~mask,'Amount']
df[cols] = df[cols].fillna(0)
df[cols] = df.groupby(['Deal'])['Running Total(ANNUAL)','Running Total(MONTHLY)'].transform('cumsum')
print(df)
Deal TYPE Month Amount Running Total(ANNUAL) \
0 A ANNUAL April 1000.00 1000.00
1 A ANNUAL April 2000.00 3000.00
2 A MONTHLY June 1500.00 3000.00
3 B MONTHLY April 11150.00 0.00
4 B ANNUAL July 700.00 700.00
5 B ANNUAL August 303.63 1003.63
6 C ANNUAL April 25624.59 25624.59
7 D ANNUAL June 5000.00 5000.00
8 D ANNUAL July 5000.00 10000.00
9 D ANNUAL August 5000.00 15000.00
10 E ANNUAL April 10.00 10.00
11 E MONTHLY May 1000.00 10.00
12 E ANNUAL May 500.00 510.00
13 E MONTHLY June 500.00 510.00
14 E ANNUAL June 600.00 1110.00
15 E MONTHLY July 300.00 1110.00
16 E MONTHLY July 8200.00 1110.00
Running Total(MONTHLY)
0 0.0
1 0.0
2 1500.0
3 11150.0
4 11150.0
5 11150.0
6 0.0
7 0.0
8 0.0
9 0.0
10 0.0
11 1000.0
12 1000.0
13 1500.0
14 1500.0
15 1800.0
16 10000.0

You can do this with .expanding.sum() which will maintain a multiIndex of the groups that you can unstack to get separate columns for each type. Use another groupby to fill the missing values within each group accordingly. Concatenate it back.
The nice thing about this is that it can be done for arbitrarily many types, without needing to define them anywhere explicitly.
import pandas as pd
df2 = (df.groupby(['Deal', 'TYPE'])
.Amount.expanding().sum()
.unstack(level=1)
.groupby(level=0)
.ffill().fillna(0)
.reset_index(level=0, drop=True)
.drop(columns='Deal'))
pd.concat([df, df2], axis=1)
Output
Deal TYPE Month Amount ANNUAL MONTHLY
0 A ANNUAL April 1000.00 1000.00 0.0
1 A ANNUAL April 2000.00 3000.00 0.0
2 A MONTHLY June 1500.00 3000.00 1500.0
3 B MONTHLY April 11150.00 0.00 11150.0
4 B ANNUAL July 700.00 700.00 11150.0
5 B ANNUAL August 303.63 1003.63 11150.0
6 C ANNUAL April 25624.59 25624.59 0.0
7 D ANNUAL June 5000.00 5000.00 0.0
8 D ANNUAL July 5000.00 10000.00 0.0
9 D ANNUAL August 5000.00 15000.00 0.0
10 E ANNUAL April 10.00 10.00 0.0
11 E MONTHLY May 1000.00 10.00 1000.0
12 E ANNUAL May 500.00 510.00 1000.0
13 E MONTHLY June 500.00 510.00 1500.0
14 E ANNUAL June 600.00 1110.00 1500.0
15 E MONTHLY July 300.00 1110.00 1800.0
16 E MONTHLY July 8200.00 1110.00 10000.0

Related

Converting the available pandas Dataframe presently across monthly into quarterly values

This is my available df, it contains year from 2016 to 2020
Year Month Bill
-----------------
2016 1 2
2016 2 5
2016 3 10
2016 4 2
2016 5 4
2016 6 9
2016 7 7
2016 8 8
2016 9 9
2016 10 5
2016 11 1
2016 12 3
.
.
.
2020 12 10
Now I want to create a 2 new columns in this dataframe Level and Contribution.
and the level column contain Q1, Q2, Q3, Q4 representing 4 quarters of the year and Contribution contains average value from bill column of each quarter in those 3 months of the respective year.
for example Q1 for 2016 will contains the average of month 1,2,3 of bill across **Contribution**
and same for Q3 for year 2020 will contains average of month 7,8,9 of the year 2020 bill column in the Contribution Column, expected Dataframe is given below
Year Month Bill levels contribution
------------------------------------
2016 1 2 Q1 5.66
2016 2 5 Q1 5.66
2016 3 10 Q1 5.66
2016 4 2 Q2 5
2016 5 4 Q2 5
2016 6 9 Q2 5
2016 7 7 Q3 8
2016 8 8 Q3 8
2016 9 9 Q3 8
2016 10 5 Q4 3
2016 11 1 Q4 3
2016 12 3 Q4 3
.
.
2020 10 2 Q4 6
2020 11 6 Q4 6
2020 12 10 Q4 6
This process will be repeated for each month 4 quarters
Iam not able to figure out the as it is something new to me
You can try:
df['levels'] = 'Q' + df['Month'].div(3).apply(math.ceil).astype(str)
df['contribution'] = df.groupby(['Year', 'levels'])['Bill'].transform('mean')
Pandas actually has a set of datatypes for monthly and quarterly values called Pandas.Period.
See this similar question:
How to create a period range with year and week of year in python?
In your case it would look like this:
from datetime import datetime
# First create dates for 1st of each month period
df['Dates'] = [datetime(row['Year'], row['Month'], 1) for i, row in df[['Year', 'Month']].iterrows()]
# Create monthly periods
df['Month Periods'] = df['Dates'].dt.to_period('M')
# Use the new monthly index
df = df.set_index('Month Periods')
# Group by quarters
df_qtrly = df['Bill'].resample('Q').mean()
df_qtrly.index.names = ['Quarters']
print(df_qtrly)
Result:
Quarters
2016Q1 5.666667
2016Q2 5.000000
2016Q3 8.000000
2016Q4 3.000000
Freq: Q-DEC, Name: Bill, dtype: float64
If you want to put these values back into the monthly dataframe you could do this:
df['Quarters'] = df['Dates'].dt.to_period('Q')
df['Contributions'] = df_qtrly.loc[df['Quarters']].values
Year Month Bill Dates Quarters Contributions
Month Periods
2016-01 2016 1 2 2016-01-01 2016Q1 5.666667
2016-02 2016 2 5 2016-02-01 2016Q1 5.666667
2016-03 2016 3 10 2016-03-01 2016Q1 5.666667
2016-04 2016 4 2 2016-04-01 2016Q2 5.000000
2016-05 2016 5 4 2016-05-01 2016Q2 5.000000
2016-06 2016 6 9 2016-06-01 2016Q2 5.000000
2016-07 2016 7 7 2016-07-01 2016Q3 8.000000
2016-08 2016 8 8 2016-08-01 2016Q3 8.000000
2016-09 2016 9 9 2016-09-01 2016Q3 8.000000
2016-10 2016 10 5 2016-10-01 2016Q4 3.000000
2016-11 2016 11 1 2016-11-01 2016Q4 3.000000
2016-12 2016 12 3 2016-12-01 2016Q4 3.000000

Pandas Panel Data - Identifying year gap and calculating returns

I am working with a large panel data of financial info, however the values are a bit spotty. I am trying to calculate the return between each year of each stock in my panel data. However, because of missing values sometimes firms have year gaps, making the: df['stock_ret'] = df.groupby(['tic'])['stock_price'].pct_change() impossible to practice as it would be wrong. The df looks something like this (just giving an example):
datadate month fyear ticker price
0 31/12/1998 12 1998 AAPL 188.92
1 31/12/1999 12 1999 AAPL 197.44
2 31/12/2002 12 2002 AAPL 268.13
3 31/12/2003 12 2003 AAPL 278.06
4 31/12/2004 12 2004 AAPL 288.35
5 31/12/2005 12 2005 AAPL 312.23
6 31/05/2008 5 2008 TSLA 45.67
7 31/05/2009 5 2009 TSLA 38.29
8 31/05/2010 5 2010 TSLA 42.89
9 31/05/2011 5 2011 TSLA 56.03
10 31/05/2014 5 2014 TSLA 103.45
.. ... .. .. .. ..
What I am looking for is a piece of code that would allow me to understand (for each individual firm) if there is any gap in the data, and calculate returns for the two different series. Just like this:
datadate month fyear ticker price return
0 31/12/1998 12 1998 AAPL 188.92 NaN
1 31/12/1999 12 1999 AAPL 197.44 0.0451
2 31/12/2002 12 2002 AAPL 268.13 NaN
3 31/12/2003 12 2003 AAPL 278.06 0.0370
4 31/12/2004 12 2004 AAPL 288.35 0.0370
5 31/12/2005 12 2005 AAPL 312.23 0.0828
6 31/05/2008 5 2008 TSLA 45.67 NaN
7 31/05/2009 5 2009 TSLA 38.29 -0.1616
8 31/05/2010 5 2010 TSLA 42.89 0.1201
9 31/05/2011 5 2011 TSLA 56.03 0.3063
10 31/05/2014 5 2014 TSLA 103.45 NaN
.. ... .. .. .. ..
If you have any other suggestions on how to treat this problem, please feel free to share your knowledge :) I am a bit inexperienced so I am sure that your advice could help!
Thank you in advance guys!
You can create a mask that tells if the last year existed and just update those years with pct change:
df['return'] = np.nan
mask = df.groupby('ticker')['fyear'].apply(lambda x: x.shift(1)==x-1)
df.loc[mask,'return'] = df.groupby('ticker')['price'].pct_change()

I want to take the average price of each unique value in each month

I am using a calendar data set for price prediction for different houses with a date feature that includes 365 days of the year. I would like to minimize the data set by taking the average month price of each listing in a new column.
input data:
listing_id date price months
1 2020-01-08 75.0 Jan
1 2020-01-09 100.0 Jan
1 2020-02-08 350.0 Feb
2 2020-01-08 465.0 Jan
2 2020-02-08 250.0 Feb
2 2020-02-09 250.0 Feb
Output data:
listing_id date Avg_price months
1 2020-01-08 90.0 Jan
1 2020-02-08 100.0 Feb
2 2020-01-08 50.0 Jan
2 2020-02-08 150.0 Feb
You can get the average price for each month using groupby:
g = df.groupby("months")["price"].mean()
You can then create new columns:
for month, avg in g.iteritems():
df["average_{}".format(month)] = avg
Example with dummy data:
import pandas as pd
df = pd.DataFrame({'months':['Jan', 'Feb', 'Feb', 'Mar', 'Mar', 'Mar'],
'price':[1, 2, 3, 4, 5, 6]})
Result:
months price average_Feb average_Jan average_Mar
0 Jan 1 2.5 1.0 5.0
1 Feb 2 2.5 1.0 5.0
2 Feb 3 2.5 1.0 5.0
3 Mar 4 2.5 1.0 5.0
4 Mar 5 2.5 1.0 5.0
5 Mar 6 2.5 1.0 5.0
I upvoted Dan's answer.
It may help to see another way to do this.
Additionally, if you ever have data that spans multiple years you may want a month_year column instead.
Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot_table.html
Example:
df = pd.DataFrame({'price':[i for i in range(121)]},
index=pd.date_range(start='12/1/2017',end='3/31/2018'))
df = df.reset_index()
df['month_year'] = df['index'].dt.month_name() + " " +
df['index'].dt.year.astype(str)
df.pivot_table(values='price',columns='month_year')
Result:
In [39]: df.pivot_table(values='price',columns='month_year')
Out[39]:
month_year December 2017 February 2018 January 2018 March 2018
price 15.0 75.5 46.0 105.0

Python Pandas Create USD_Converted column

I have two dataframes, deals
Currency Deal_Amount
0 USD 18.40
1 USD 18.40
2 USD 5559.00
3 USD 14300.00
4 USD 1000.00
5 EUR 3072.00
6 USD 500.00
7 CAD 100000.00
8 USD 250.00
15 EUR 6000.00
and currency_rates
currency_code year quarter from_usd_rate to_usd_rate
AED 2018 3 3.67285 0.27226813
ARS 2018 3 17.585 0.056866648
AUD 2018 3 1.27186 0.786250059
BRL 2018 3 3.1932 0.313165477
CAD 2018 3 1.2368 0.808538163
EUR 2018 3 0.852406 1.173149884
GBP 2018 3 0.747077 1.338550109
GHS 2018 3 4.4 0.227272727
I want to create a column in deals that converts deals where deals['Currency'] != USD, and apply the currency_rate['to_usd_rate'] to deals['Deal_Amount'] to get the USD converted amount.
So far i tried
def convert_amount(data):
if data['Currency']==currency_rates['currency_code']:
Converted_amount=data['Deal_Amount'] * currency_rates['to_usd_rate']
return Converted_amount
but its not working.
You can merge and fillna with 1 to use as new rate
A = df1.merge(df2, left_on='Currency', right_on='currency_code', how='left').fillna(1)
df1['converted'] = A['Deal_Amount']*A['to_usd_rate']
output
index Currency Deal_Amount converted
0 0 USD 18.4 18.400000
1 1 USD 18.4 18.400000
2 2 USD 5559.0 5559.000000
3 3 USD 14300.0 14300.000000
4 4 USD 1000.0 1000.000000
5 5 EUR 3072.0 3603.916444
6 6 USD 500.0 500.000000
7 7 CAD 100000.0 80853.816300
8 8 USD 250.0 250.000000
9 15 EUR 6000.0 7038.899304

Python - Pandas: how to divide by specific key's value

I would like to calculate the column by other row of pandas dataframe.
For example, when I have these dataframes,
df = pd.DataFrame({
"year" : ['2017', '2017', '2017', '2017', '2017','2017', '2017', '2017', '2017'],
"rooms" : ['1', '2', '3', '1', '2', '3', '1', '2', '3'],
"city" : ['tokyo', 'tokyo', 'toyko', 'nyc','nyc', 'nyc', 'paris', 'paris', 'paris'],
"rent" : [1000, 1500, 2000, 1200, 1600, 1900, 900, 1500, 2200],
})
print(df)
city rent rooms year
0 tokyo 1000 1 2017
1 tokyo 1500 2 2017
2 toyko 2000 3 2017
3 nyc 1200 1 2017
4 nyc 1600 2 2017
5 nyc 1900 3 2017
6 paris 900 1 2017
7 paris 1500 2 2017
8 paris 2200 3 2017
I'd like to add the rent compared to other city's rent in the same year and rooms.
Ideal results are like below,
city rent rooms year vs_nyc
0 tokyo 1000 1 2017 0.833333
1 tokyo 1500 2 2017 0.9375
2 toyko 2000 3 2017 1.052631
3 nyc 1200 1 2017 1.0
4 nyc 1600 2 2017 1.0
5 nyc 1900 3 2017 1.0
6 paris 900 1 2017 0.75
7 paris 1500 2 2017 0.9375
8 paris 2200 3 2017 1.157894
How to add column like vs_nyc taking account of the year and rooms?
I tried some but not worked,
# filtering gets NaN value, and fillna(method='pad') also not worked
df.rent / df[df['city'] == 'nyc'].rent
0 NaN
1 NaN
2 NaN
3 1.0
4 1.0
5 1.0
6 NaN
7 NaN
8 NaN
Name: rent, dtype: float64
To illustrate:
set_index + unstack
d1 = df.set_index(['city', 'year', 'rooms']).rent.unstack('city')
d1
city nyc paris tokyo toyko
year rooms
2017 1 1200.0 900.0 1000.0 NaN
2 1600.0 1500.0 1500.0 NaN
3 1900.0 2200.0 NaN 2000.0
Then we can divide
d1.div(d1.nyc, 0)
city nyc paris tokyo toyko
year rooms
2017 1 1.0 0.750000 0.833333 NaN
2 1.0 0.937500 0.937500 NaN
3 1.0 1.157895 NaN 1.052632
solution
d1 = df.set_index(['city', 'year', 'rooms']).rent.unstack('city')
df.join(d1.div(d1.nyc, 0).stack().rename('vs_nyc'), on=['year', 'rooms', 'city'])
city rent rooms year vs_nyc
0 tokyo 1000 1 2017 0.833333
1 tokyo 1500 2 2017 0.937500
2 toyko 2000 3 2017 1.052632
3 nyc 1200 1 2017 1.000000
4 nyc 1600 2 2017 1.000000
5 nyc 1900 3 2017 1.000000
6 paris 900 1 2017 0.750000
7 paris 1500 2 2017 0.937500
8 paris 2200 3 2017 1.157895
A little cleaned up
cols = ['city', 'year', 'rooms']
ny_rent = df.set_index(cols).rent.loc['nyc'].rename('ny_rent')
df.assign(vs_nyc=df.rent / df.join(d1, on=d1.index.names).ny_rent)

Categories