Pandas - Aggregating grouped data provided certain criteria is met - python

I want to create a time series of the valuation of a stock portfolio, by aggregating time series valuation data on the individual stock holdings of that portfolio. The problem I have is that on certain dates there may not be a valuation for a given stock holding and thus aggregating on that date would produce erroneous results.
The solution I have come up with is to exclude dates for which valuation (actually price) data doesn't exist for a given holding and then aggregate on these dates where I have complete data. The procedure I use is as follows:
# Get the individual holding valuation data
valuation = get_valuation(portfolio = portfolio, df = True)
# Then next few lines retrieve the dates for which I have complete price data for the
# assets that comprise this portflio
# First get a list of the assets that this portfolio contains (or has contained).
unique_assets = valuation['asset'].unique().tolist()
# Then I get the price data for these assets
ats = get_ats(assets = unique_assets, df = True )[['data_date','close_price']]
# I mark those dates for which I have a 'close_price' for each asset:
ats = ats.groupby('data_date')['close_price'].agg({'data_complete':lambda x: len(x) == len(unique_assets)} ).reset_index()
# And extract the corresponding valid dates.
valid_dates = ats['data_date'][ats['data_complete']]
# Filter the valuation data for those dates for which I have complete data:
valuation = valuation[valuation['data_date'].apply(lambda x: x in valid_dates.values)]
# Group by date, and sum the individual hodling valuations by date, to get the Portfolio valuation
portfolio_valuation = valuation[['data_date','valuation']].groupby('data_date').agg(lambda df: sum(df['valuation'])).reset_index()
My question is two fold:
1) The above approach feels quite convoluted, and I am confident that Pandas has a better way of implementing my solution. Any suggestions?
2) The approach I have used isn't ideal. The best method is that for those dates for which we have no valuation data (for a given holding) we should use the most recent valuation for that holding. So let's say I am calculating the valuation of the portfolio on the 21 June 2012 and have valuation data for GOOG on that date but for APPL only on the 20 June 2012. Then the valuation for the portfolio on 21 June 2012 should still be the sum of these two valuations. Is there a efficient way to do this in Pandas? I want to avoid having to iterate through the data.

It seems like some combination of resample and/or fillna is going to get you what you're looking for (realize this is coming a little late!).
Go grab your data just like you're doing. You get back this things with a few gaps. Check out this:
import pandas as pd
import numpy as np
dates = pd.DatetimeIndex(start='2012-01-01', periods=10, freq='2D')
df = pd.DataFrame(np.random.randn(20).reshape(10,2),index=dates)
So now you have this data that has lots of gaps in it -- but you want to have this daily resolution data.
Just do:
df.resample('1D')
This will fill in your dataframe with a bunch of NaNs where you have missing data. And then when you do aggregations on them, just use functions (e.g., np.nansum, np.mean) that ignore NaNs!
Still a little unclear on the exact format of the data you've got. Hope it helps.

Related

Cumulative Returns with Rebalancing dates using two Dataframes with different lenghts

I wanted to build a loop to cumulate daily log returns with a yearly reset to 100 on a specific date in January. First Problem I am working with different dataframes.
df = ETF Data and main dataframe for different parts of calculation
Maturity_Date_Option_1 = Dataframe with different maturity dates
-> If Df and Date in Maturity_Data_Option_1 matches it should reset to 100 and calculate the cum daily return onwords till the next match of the two dataframes. :)
I feel like I am near to the answere but missing sth...
Hopefully you could help me with my problem. :)
for t in df.index:
df['NDDUWI_cum_daily_returns'] = (df['NDDUWI_daily_return_log'] + df['NDDUWI_cum_daily_returns'].shift(1))
print(t)
if t in Maturity_Date_Option_1.tolist():
df['NDDUWI_cum_daily_returns'] = 100
else:
df['NDDUWI_cum_daily_returns']```
Is there a more efficient way to structure this task? Sorry

groupby.mean function dividing by pre-group count rather than post-group count

So I have the following dataset of trade flows that track imports, exports, by reporting country and partner countries. After I remove some unwanted columns, I edit my data frame such that trade flows between country A and country B is showing. I'm left with something like this:
[My data frame image] 1
My issue is that I want to be able to take the average of imports and exports for every partner country ('partner_code') per year, but when I run the following:
x = df[(df.location_code.isin(["IRN"])) &
df.partner_code.isin(['TCD'])]
grouped = x.groupby(['partner_code']).mean()
I end up getting an average of all exports divided by all instances where there is a 'product_id' (so a much higher number) rather than averaging imports or exports by total for all the years.
Taking the average of the following 5 export values gives an incorrect average:
5 export values
Wrong average
In pandas, we can groupby multiple columns, based on my understanding you want to group by partner, country and year.
The following line would work:
df = df.groupby(['partner_code', 'location_code', 'year'])['import_value', 'export_value'].mean()
Please note that the resulting dataframe is has MultiIndex index.
For reference, the official documentation: DataFrame.groupby documentation

Python: Shift time series so they all match at a given y value

I'm writing my own code to analyse/visualise COVID-19 data from the European CDC.
https://opendata.ecdc.europa.eu/covid19/casedistribution/csv'
I've got a simple code to extract the data and make plots with cumulative deaths against time, and am trying to add functionality.
My aim is something like the attached graph, with all countries time shifted to match at (in this case the 5th death) I want to make a general bit of code to shift countries to match at the 'n'th death.
https://ourworldindata.org/grapher/covid-confirmed-deaths-since-5th-death
The current way I'm trying to do this is to have a maze of "if group is 'country' shift by ..." terms.
Where ... is a lookup to find the date for the particular 'country' when there were 'n' deaths, and to interpolate fractional dates where appropriate.
i.e. currently deaths are assigned as 00:00 day/month, but the data can be shifted by 2/3 a day as below.
datetime cumulative deaths
00:00 15/02 80
00:00 16/02 110
my '...' should give 16:00 15/02
I'm working on this right now but it doesn't feel very efficient and I'm sure there must be a much simpler way that I'm not seeing.
Essentially despite copious googling I can't seem to find a simple way of automatically shifting a bunch of timeseries to match at a particular y value, which feels like it should have some built-in functionality, i.e. a Lookup with interpolation.
####Live url (I've downloaded my own csv and been calling that for code development)
url = 'https://opendata.ecdc.europa.eu/covid19/casedistribution/csv'
dataraw = pd.read_csv(url)
#extract relevanty colums
data = dataraw.loc[:,["dateRep","countriesAndTerritories","deaths"]]
####convert date format
data['dateRep'] = pd.to_datetime(data['dateRep'],dayfirst=True)
####sort by date
data = data.sort_values(["dateRep"],ascending=True)
data['cumdeaths'] = data.groupby(['countriesAndTerritories']).cumsum()
##### limit to countries with cumulative deaths > 500
data = data.groupby('countriesAndTerritories').filter(lambda x:x['cumdeaths'].max() >500)
###### remove China from data for now as it doesn't match so well with dates
data = data.groupby('countriesAndTerritories').filter(lambda x:(x['countriesAndTerritories'] != "China").any())
##### only recent dates
data = data[data['dateRep'] > '2020-03-01']
print(data)
You can use groupby('country') and the pd.transform function to add a column which will set every row with the date in which its country hit the nth death.
Then you can do a vectorized subtraction of the date column and the new column to get the number of days.

Intraday daily return

I am quite new in python and I need your help guys!
my data structure
this data is intraday data 5 min from 2001/01/02 till 31/12/2019. As you can see from the data 0 indicated the date, and 2 indicate the prices of the stock.
Each day, such as 2001/01/02 has 79 observation.
First of all, I need to create a daily return as a new column. Normaly I was dealing with daily data and for the daily log return was as follow
def lr(x):
return np.log(x[1:]) - np.log(x[:-1])
How can I create new column for the daily return from the 5 min data.
If you load your data into a pandas.DataFrame, you can use df.groupby() and then apply your lr-function with minimal changes:
df = pd.read_excel('path/to/your/file.xlsx', header=None,
names=['Index', 'Date', 'Some_var', 'Stock_price'])
The key thing to do, though, will be to decide how you want to generate your daily values from your 5 minute data. I'm no stock expert, but I'd guess you want to use the last value for each day to represent the stock value. If that's the case, you can use
daily_values = df.groupby('Date')['Stock_price'].agg('last')
and then apply your lr function to get the returns
lr(daily_values)

Lag values and differences in pandas dataframe with missing quarterly data

Though Pandas has time series functionality, I am still struggling with dataframes that have incomplete time series data.
See the pictures below, the lower picture has complete data, the upper has gaps. Both pics show correct values. In red are the columns that I want to calculate using the data in black. Column Cumm_Issd shows the accumulated issued shares during the year, MV is market value.
I want to calculate the issued shares per quarter (IssdQtr), the quarterly change in Market Value (D_MV_Q) and the MV of last year (L_MV_Y).
See for underlying cvs data this link for the full data and this link for the gapped data. There are two firms 1020180 and 1020201.
However, when I try Pandas shift method it fails when there are gaps, try yourself using the csv files and the code below. All columns (DiffEq, Dif1MV, Lag4MV) differ - for some quarters - from IssdQtr, D_MV_Q, L_MV_Y, respectively.
Are there ways to deal with gaps in data using Pandas?
import pandas as pd
import numpy as np
import os
dfg = pd.read_csv('example_soverflow_gaps.csv',low_memory=False)
dfg['date'] = pd.to_datetime(dfg['Period'], format='%Y%m%d')
dfg['Q'] = pd.DatetimeIndex(dfg['date']).to_period('Q')
dfg['year'] = dfg['date'].dt.year
dfg['DiffEq'] = dfg.sort_values(['Q']).groupby(['Firm','year'])['Cumm_Issd'].diff()
dfg['Dif1MV'] = dfg.groupby(['Firm'])['MV'].diff(1)
dfg['Lag4MV'] = dfg.groupby(['Firm'])['MV'].shift(4)
Gapped data:
Full data:
Solved the basic problem by using a merge. First, create a variable that shows the lagged date or quarter. Here we want last year's MV (4 quarters back):
from pandas.tseries.offsets import QuarterEnd
dfg['lagQ'] = dfg['date'] + QuarterEnd(-4)
Then create a data-frame with the keys (Firm and date) and the relevant variable (here MV).
lagset=dfg[['Firm','date', 'MV']].copy()
lagset.rename(columns={'MV':'Lag_MV', 'date':'lagQ'}, inplace=True)
Lastly, merge the new frame into the existing one:
dfg=pd.merge(dfg, lagset, on=['Firm', 'lagQ'], how='left')

Categories