groupby and join with pandas dataframe

groupby and join with pandas dataframe - python

Here is part of the data of scaffold_table
import pandas as pd
scaffold_table = pd.DataFrame({
'Position':[2000]*5,
'Company':['Amazon', 'Amazon', 'Alphabet', 'Amazon', 'Alphabet'],
'Date':['2020-05-26','2020-05-27','2020-05-27','2020-05-28','2020-05-28'],
'Ticker':['AMZN','AMZN','GOOG','AMZN','GOOG'],
'Open':[2458.,2404.9899,1417.25,2384.330078,1396.859985],
'Volume':[3568200,5056900,1685800,3190200,1692200],
'Daily Return':[-0.006164,-0.004736,0.000579,-0.003854,-0.000783],
'Daily PnL':[-12.327054,-9.472236,1.157283,-7.708126,-1.565741],
'Cumulative PnL/Ticker':[-12.327054,-21.799290,1.157283,-29.507417,-0.408459]})
I would like to create a summary table that returns the overall yield per ticker. The overall yield should be calculated as the total PnL per ticker divided by the last date's position per ticker
# Create a summary table of your average daily PnL, total PnL, and overall yield per ticker
summary_table = pd.DataFrame(scaffold_table.groupby(['Date','Ticker'])['Daily PnL'].mean())
position_ticker = pd.DataFrame(scaffold_table.groupby(['Date','Ticker'])['Position'].sum())
# the total PnL is the sum of PnL per Ticker after two years period
totals = summary_table.droplevel('Date').groupby('Ticker').sum().rename(columns={'Daily PnL':'total PnL'})
summary_table = summary_table.join(totals, on='Ticker')
summary_table = summary_table.join(position_ticker, on = ['Date','Ticker'], how='inner')
summary_table['Yield'] = summary_table.loc['2022-04-29']['total PnL']/summary_table.loc['2022-04-29']['Position']
summary_table
But the yield is showing NaN, could anyone take a look at my codes?
I used ['2022-04-29'] because it is the last date, but I think there are some codes to return the last date without explicitly inputting that.

I solved the problem with the following code
# we want the overall yield per ticker, so total PnL/Position on the last date
summary_table['Yield'] = summary_table['total PnL']/summary_table.loc['2022-04-29']['Position']
This does not specify the date for total PnL since it's the sum by ticker without regard of the date.

I note the comment in your code saying: "Create a summary table of your average daily PnL, total PnL, and overall yield per ticker".
If we start from this, here are a few observations:
the average daily PnL per ticker should just be the mean of Daily PnL for each ticker
the total PnL per ticker is already listed in the Cumulative PnL/Ticker column, so if we use groupby on Ticker and get the value in the Cumulative PnL/Ticker column for the most recent date (namely, for the last() row in the groupby assuming the df is sorted by date), we don't have to calculate it
for the overall yield per ticker (which you have specified "should be calculated as the total PnL per ticker divided by the last date's position per ticker") we can get the relevant Position (namely, for the most recent date per ticker) analogously to how we got the relevant Cumulative PnL/Ticker and use these two values to calculate Yield.
Here is sample code to do this:
import pandas as pd
scaffold_table = pd.DataFrame({
'Position':[2000]*5,
'Company':['Amazon', 'Amazon', 'Alphabet', 'Amazon', 'Alphabet'],
'Date':['2020-05-26','2020-05-27','2020-05-27','2020-05-28','2020-05-28'],
'Ticker':['AMZN','AMZN','GOOG','AMZN','GOOG'],
'Open':[2458.,2404.9899,1417.25,2384.330078,1396.859985],
'Volume':[3568200,5056900,1685800,3190200,1692200],
'Daily Return':[-0.006164,-0.004736,0.000579,-0.003854,-0.000783],
'Daily PnL':[-12.327054,-9.472236,1.157283,-7.708126,-1.565741],
'Cumulative PnL/Ticker':[-12.327054,-21.799290,1.157283,-29.507417,-0.408459]})
print(scaffold_table)
# Create a summary table of your average daily PnL, total PnL, and overall yield per ticker
gb = scaffold_table.groupby(['Ticker'])
summary_table = gb.last()[['Position', 'Cumulative PnL/Ticker']].rename(columns={'Cumulative PnL/Ticker':'Total PnL'})
summary_table['Yield'] = summary_table['Total PnL'] / summary_table['Position']
summary_table['Average Daily PnL'] = gb['Daily PnL'].mean()
summary_table = summary_table[['Average Daily PnL', 'Total PnL', 'Yield']]
print('\nsummary_table:'); print(summary_table)
exit()
Input:
Position Company Date Ticker Open Volume Daily Return Daily PnL Cumulative PnL/Ticker
0 2000 Amazon 2020-05-26 AMZN 2458.000000 3568200 -0.006164 -12.327054 -12.327054
1 2000 Amazon 2020-05-27 AMZN 2404.989900 5056900 -0.004736 -9.472236 -21.799290
2 2000 Alphabet 2020-05-27 GOOG 1417.250000 1685800 0.000579 1.157283 1.157283
3 2000 Amazon 2020-05-28 AMZN 2384.330078 3190200 -0.003854 -7.708126 -29.507417
4 2000 Alphabet 2020-05-28 GOOG 1396.859985 1692200 -0.000783 -1.565741 -0.408459
Output:
Average Daily PnL Total PnL Yield
Ticker
AMZN -9.835805 -29.507417 -0.014754
GOOG -0.204229 -0.408459 -0.000204

Related

Pandas calculating sales for recurring monthly payments

I have a dataset with millions of records just like below
CustomerID
StartTime
EndTime
1111
2015-7-10
2016-3-7
1112
2016-1-5
2016-1-19
1113
2015-10-18
2020-9-1
This dataset contains the information for different subscription contracts and it is assumed that:
if the contract is active then the customer will need to pay a monthly fee in advance. The first payment will be collected on the start date.
If the contract ends before the next payment date, which is exactly one month after the last payment date, the customer does not need to pay the next subscription. For instance, customer 1112 only needs to pay once.
monthly payment fee is $10
In this situation, I need to calculate the monthly/quarterly/annual sales between 2015 and 2020. It is ideal to also show the breakdown of sales by different customer IDs so that subsequent machine learning tasks can be performed.

Importing data (I saved your table as a .csv in Excel, which is the reason for the specific formatting of the pd.to_datetime):
import pandas as pd
import numpy as np
df = pd.read_csv("Data.csv", header=0)
# convert columns "to_datetime"
df["StartTime"] = pd.to_datetime(df["StartTime"], format="%d/%m/%Y")
df["EndTime"] = pd.to_datetime(df["EndTime"], format="%d/%m/%Y")
Calculate the number of months between the start and end dates (+1 at the end because there will be a payment even if the contract is not active for a whole month, because it is in advance):
df["Months"] = ((df["EndTime"] - df["StartTime"])/np.timedelta64(1, 'M')).astype(int) + 1
Generate a list of payment dates (from the start date, for the given number of months, one month apart). The pd.tseries.offsets.DateOffset(months=1) will ensure that the payment date is on the same day every month, rather than the default end-of-month if freq="M".
df["PaymentDates"] = df.apply(lambda x: list(pd.date_range(start=x["StartTime"], periods=x["Months"], freq=pd.tseries.offsets.DateOffset(months=1)).date), axis=1)
Create a new row for each payment date, add a payment column of 10, then pivot so that the CustomerID is the column, and the date is the row:
df = df.explode("PaymentDates").reset_index(drop=True)
df["PaymentDates"] = pd.to_datetime(df["PaymentDates"])
df["Payment"] = 10
df = pd.pivot_table(df, index="PaymentDates", columns="CustomerID", values="Payment")
Aggregate for month, quarter, and year sales (this will be an aggregation for each individual CustomerID. You can then sum by row to get a total amount:
months = df.groupby([df.index.year, df.index.month]).sum()
quarters = df.groupby([df.index.year, df.index.quarter]).sum()
years = df.groupby(df.index.year).sum()
# total sales
months["TotalSales"] = months.sum(axis=1)
quarters["TotalSales"] = quarters.sum(axis=1)
years["TotalSales"] = years.sum(axis=1)
I realise this may be slow for the df.apply if you have millions of records, and there may be other ways to complete this, but this is what I have thought of.
You will also have a lot of columns if there are many millions of customers, but this way you will keep all the CustomerID values separate and be able to know which customers made payments in a given month.
After the number of months is calculated in df["Months"], you could then multiply this by 10 to get the number of sales for each customer.
If this is the only data you need for each customer individually, you would not need to pivot the data at all, just aggregate on the "PaymentDates" column, count the number of rows and multiply by 10 to get the sales for month, quarter, year.

Python Pandas Market Calendars day count (Trading day vs Calendar Days)

I am conducting some market research and one of the variables I am investigating is the distribution of time for an event to occur as a log distribution and create a cumulative probability density function as a function of time. ( I simply convert my dates as so:
A=datetime.strptime(UDate1[0],date_format)
B=datetime.strptime(UDate2[0],date_format)
and I can subtract like so:
C=(A-B).days
and I am returned an integer of the number of days.. 5, 6, 10, 11.. whatever it may be).
My data should fit a log distribution,however, because I am currently using calendar days and my events only occur on market days ... it is an unacceptable source of error, and it creates empty histograms within my distribution (days 6 and 7 are always zero (weekends), and holiday effects).
I cannot calculate an accurate cumulative distribution function in this way so I recently downloaded the Pandas Market Calendar. Does anyone have experience figuring out how to calculate trading days vs market days. For example if I was looking at the time from July 19, 2020 to July 13, 2020. It would be 12 Calendar days, but only 8 trading days.

Info on the Pandas Market Calendars is here:
https://pypi.org/project/pandas-market-calendars/
First, create a market data object as described in the link:
import pandas_market_calendars as mcal
# Create a calendar
nyse = mcal.get_calendar('NYSE')
early = nyse.schedule(start_date='2012-07-01', end_date='2012-07-10')
print(mcal.date_range(early, frequency='1D'))
DatetimeIndex(['2012-07-02 20:00:00+00:00', '2012-07-03 17:00:00+00:00',
'2012-07-05 20:00:00+00:00', '2012-07-06 20:00:00+00:00',
'2012-07-09 20:00:00+00:00', '2012-07-10 20:00:00+00:00'],
dtype='datetime64[ns, UTC]', freq=None)
Now, create a series with value of ones, and indexed by market days. Then re-index on calendar days, and fill missing values with zeros. Compute cumulative sum, and number of trading days between two dates is difference between cumulative sums at different dates:
import pandas as pd
bus_day_index = pd.DatetimeIndex(
['2012-07-02 20:00:00+00:00', '2012-07-03 17:00:00+00:00',
'2012-07-05 20:00:00+00:00', '2012-07-06 20:00:00+00:00',
'2012-07-09 20:00:00+00:00', '2012-07-10 20:00:00+00:00'],
dtype='datetime64[ns, UTC]', freq=None)
bus_day_index = bus_day_index.normalize()
s = pd.Series(data=1, index=bus_day_index)
cal_day_index = pd.date_range(start=bus_day_index.min(), end=bus_day_index.max())
s = s.reindex(index=cal_day_index).fillna(0).astype(int)
s = s.cumsum()
s['2012-07-09'] - s['2012-07-03']
Advantage: This (inelegant) method incorporates non-trading days that fall on weekdays (Memorial Day, Labor Day, etc. in the U.S.).

From your question it sounds like you want to count the number of trading days. If so please try the following out:
from datetime import datetime, date, timedelta
start_date = A
end_date = B
delta = timedelta(days=1)
count = 0
while start_date <= end_date:
print (start_date.strftime("%Y-%m-%d"))
if start_date.weekday() <=5:
count +=1
start_date += delta
print(count)

How do I continiously calculate something based on the past X amount of data? (Please see info for more details)

Goal:
Calculate 50day moving average for each day, based on the past 50 days. I can calculate the mean for the entire dataset, but I am trying to contiously calculate the mean based on the past 50 days...with it changing each day of course!
import numpy as np
import pandas_datareader.data as pdr
import pandas as pd
# Define the instruments to download. We would like to see Apple, Microsoft and the S&P500 index.
ticker = ['AAPL']
#Define the data period that you would like
start_date = '2017-07-01'
end_date = '2019-02-08'
# User pandas_reader.data.DataReader to load the stock prices from Yahoo Finance.
df = pdr.DataReader(ticker, 'yahoo', start_date, end_date)
# Yahoo Finance gives 'High', 'Low', 'Open', 'Close', 'Volume', 'Adj Close'.
#Export Close PRice, Volume, and Date from yahoo finance
CloseP = df['Close']
CloseP.head()
Volm = df['Volume']
Volm.head()
Date = df["Date"] = df.index
#create a table with Date, Close Price, and Volume
Table = pd.DataFrame(np.array(Date), columns = ['Date'])
Table['Close Price'] = np.array(CloseP)
Table['Volume'] = np.array(Volm)
print (Table)
#create a column that contiosuly calculates 50 day MA
#This is what I can't get to work!
MA = np.mean(df['Close'])
Table['Moving Average'] = np.array(MA)
print (Table)

First of all, please, don't use CamelCase to name your variables, as they look as class names otherwise.
Next, use merge() to join your data frames instead of those yours np.array way:
>>> table = CloseP.merge(Volm, left_index=True, right_index=True)
>>> table.columns = ['close', 'volume'] # give names to columns
>>> table.head(10)
close volume
Date
2017-07-03 143.500000 14277800.0
2017-07-05 144.089996 21569600.0
2017-07-06 142.729996 24128800.0
2017-07-07 144.179993 19201700.0
2017-07-10 145.059998 21090600.0
2017-07-11 145.529999 19781800.0
2017-07-12 145.740005 24884500.0
2017-07-13 147.770004 25199400.0
2017-07-14 149.039993 20132100.0
2017-07-17 149.559998 23793500.0
Finally, use combination of rolling(), mean() and dropna() to calculate moving average:
>>> ma50 = table.rolling(window=50).mean().dropna()
>>> ma50.head(10)
close volume
Date
2017-09-12 155.075401 26092540.0
2017-09-13 155.398401 26705132.0
2017-09-14 155.682201 26748954.0
2017-09-15 156.025201 27248670.0
2017-09-18 156.315001 27430024.0
2017-09-19 156.588401 27424424.0
2017-09-20 156.799201 28087816.0
2017-09-21 156.952201 28340360.0
2017-09-22 157.034601 28769280.0
2017-09-25 157.064801 29254384.0
Please, refer to the docs of mentioned API calls to get more info about their usage. Good luck!

Calculating cummulative returns mid-year to mid-year Pandas

I have a Pandas dataframe of the size (80219 * 5) with the same structure as the image I have uploaded. The data can range from 2002-2016 for each company but if missing values appear the data either starts at a later date or ends at an earlier date as you can see in the image.
What I would like to do is to calculate yearly compounded returns measured from June to June for each company. If there is no data for the specific company for the full 12 months period from June to June the result should be nan. Below is my current code, but I don't know how to calculate the returns from June to June.
After having loaded the file and cleaned it I:
df[['Returns']] = df[['Returns']].apply(pd.to_numeric)
df['Names Date'] = pd.to_datetime(df['Names Date'])
df['Returns'] = df['Returns']+ 1
df = df[['Company Name','Returns','Names Date']]
df['year']=df['Names Date'].dt.year
df['cum_return'] = df.groupby(['Company Name','year']).cumprod()
df = df.groupby(['Company Name','year']).nth(11)
print(tabulate(df, headers='firstrow', tablefmt='psql'))
Which calculates the annual return from 1st of january to 31st of december..

I finally found a way to do it. The easiest way I could find is to calculate a rolling 12 month compounded return for each month and then slice the dataframe for to give me the 12 month returns of the months I want:
def myfunc(arr):
return np.cumprod(arr)[-1]
cum_ret = pd.Series()
grouped = df.groupby('Company Name')
for name, group in grouped:
cum_ret = cum_ret.append(pd.rolling_apply(group['Returns'],12,myfunc))
df['Cum returns'] = cum_ret
df = df.loc[df['Names Date'].dt.month==6]
df['Names Date'] = df['Names Date'].dt.year

Reading stocks from multiple sources using Pandas Datareader

I have a list of 6 stocks. I have set up my code to reference the stock name from the list vs hard coding in the stock name ... first with SPY which is in position 0. The code below the list will return yesterday's closing price of stock.
My question is: how do I loop the code through each stock in the list so that I print out the closing price for all 6 stocks?
I think I need to use loops but I don't understand them.
Any ideas?
CODE:
#import packages
import pandas_datareader.data as web
import datetime as dt
#create list of stocks to reference later
stocks = ['SPY', 'QQQ', 'IWM', 'AAPL', 'FB', 'GDX']
#define prior day close price
start = dt.datetime(2010, 1, 1)
end = dt.datetime(2030, 1, 27)
ticker = web.DataReader(stocks[0], 'google', start, end)
prior_day = ticker.iloc[-1]
PDL = list(prior_day)
prior_close = PDL[3]
#print the name of the stock from the stocks list, and the prior close price
print(stocks[0])
print('Prior Close')
print(prior_close)
RETURNS:
SPY
Prior Close
249.08

You could use a loop, but you don't need loops for this. Pass your entire list of stocks to the DataReader. This should be cheaper than making multiple calls.
stocks = ['SPY', 'QQQ', 'IWM', 'AAPL', 'FB', 'GDX']
ticker = web.DataReader(stocks, 'google', start, end)
close = ticker.to_frame().tail()['Close'].to_frame('Prior Close')
print(close)
Prior Close
Date minor
2017-09-26 FB 164.21
GDX 23.35
IWM 144.61
QQQ 143.17
SPY 249.08
Details
ticker is a panel, but can be converted to a dataframe using to_frame:
print(ticker)
<class 'pandas.core.panel.Panel'>
Dimensions: 5 (items) x 251 (major_axis) x 6 (minor_axis)
Items axis: Open to Volume
Major_axis axis: 2016-09-28 00:00:00 to 2017-09-26 00:00:00
Minor_axis axis: AAPL to SPY
df = ticker.to_frame()
You can view all recorded dates of stocks using df.index.get_level_values:
print(df.index.get_level_values('Date'))
DatetimeIndex(['2016-09-28', '2016-09-28', '2016-09-28', '2016-09-28',
'2016-09-28', '2016-09-28', '2016-09-29', '2016-09-29',
'2016-09-29', '2016-09-29',
...
'2017-09-25', '2017-09-25', '2017-09-25', '2017-09-25',
'2017-09-26', '2017-09-26', '2017-09-26', '2017-09-26',
'2017-09-26', '2017-09-26'],
dtype='datetime64[ns]', name='Date', length=1503, freq=None)
If you want to view all stocks for a particular date, you can use df.loc with a slice. For your case, you want to see the closing stocks on the last date, so you can use df.tail:
print(df.tail()['Close'].to_frame())
Close
Date minor
2017-09-26 FB 164.21
GDX 23.35
IWM 144.61
QQQ 143.17
SPY 249.08

You can just use a for loop
for stock in stocks:
start = dt.datetime(2010, 1, 1)
end = dt.datetime(2030, 1, 27)
ticker = web.DataReader(stock, 'google', start, end)
prior_day = ticker.iloc[-1]
PDL = list(prior_day)
prior_close = PDL[3]
print(stock)
print('Prior Close')
print(prior_close)

I'll make you a function that you can always pass for a list of stocks and that provides you a time series. ;)
I use this function for numerous tickers
tickers = ['SPY', 'QQQ', 'EEM', 'INDA', 'AAPL', 'MSFT'] # add as many tickers
start = dt.datetime(2010, 3,31)
end = dt.datetime.today()
# Function starts here
def get_previous_close(strt, end, tick_list, this_price):
""" arg: `this_price` can take str Open, High, Low, Close, Volume"""
#make an empty dataframe in which we will append columns
adj_close = pd.DataFrame([])
# loop here.
for idx, i in enumerate(tick_list):
total = web.DataReader(i, 'google', strt, end)
adj_close[i] = total[this_price]
return adj_close
#call the function
get_previous_close(start, end, tickers, 'Close')
You can use this time series in any way possible. It's always good to use a function that has maintainability and re-usability. Also, this function can take yahoo instead of google

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

groupby and join with pandas dataframe - python

Related

Pandas calculating sales for recurring monthly payments

Python Pandas Market Calendars day count (Trading day vs Calendar Days)

How do I continiously calculate something based on the past X amount of data? (Please see info for more details)

Calculating cummulative returns mid-year to mid-year Pandas

Reading stocks from multiple sources using Pandas Datareader

Categories

Resources