Having a strange issue that skirts the line between stat/math and programming (I'm unsure what the root issue is).
I have a dataframe of returns by asset by month. Here's a small snapshot of the dataframe:
Strategy Theme Idea month PnL
Event Catalyst European Oil Services 2019-05 -1.412264e-10
Event Catalyst European Oil Services 2019-06 -2.688968e-08
Event Catalyst None 2019-06 1.546945e-08
Event M&A None 2019-06 2.128868e-08
Fundamental 5G Rollout Intelsat 2019-01 1.375019e-02
Now, when I group by month and sum, then compound the result, I get the answer I'd expect, and what I am able to replicate in excel:
x = df.groupby(['month']).sum()
total_pnl = x.compound()['PnL']
Gets me my excepted result of 4.16%. So far so good. However, taking the same dataframe, and compounding the individual Ideas, then summing, yields a different answer: 5.89%
dfx = df.pivot_table(index='month',columns=['Strategy','Theme','Idea'],values='PnL')
dfx = dfx.fillna(0.0)
dfx = dfx.compound()
dfx = dfx.reset_index(drop=False)
dfx.columns= ['Strategy','Theme','Idea','PnL']
total_pnl = sum(dfx['PnL'])
At first, I thought it was simply that it is not mathematically acceptable to compound individual returns then sum them, but a simple example I did in excel proved to me that the two methods should be the same. I then checked in excel if having 0 returns on any given month would be a problem -- it isn't.
Now I've been scratching my head for a couple hours trying to figure out why these numbers don't match when I do it in Python, while my simple excel example shows me they should.
Can you think of any caveats that I'm not taking into account that may be causing this?
Related
I have two large dataframes where df1 has more rows than df2 due to df1 operating in a finer time resolution of the logistics in question. I want to match two value columns of df2 to df1, and created a time reference column using the df.dt.floor() function so that a df1.time_ref == df2.time surjective mapping can be applied.
Imagine something like this:
df1: df2:
time time_ref time sale nbr
0 10.10 01.10 01.10 27344 4
1 17.10 01.10 01.11 31160 5
2 24.10 01.10 01.12 19482 3
3 31.10 01.10
4 07.11 01.11
5 14.11 01.11
6 21.11 01.11
7 28.11 01.11
8 05.12 01.12
The goal is to display the fraction of sale/nbr of a month to every week of the month for reference.
It should therefore end up like this:
df1:
time time_ref monthlyObjAvg
0 10.10 01.10 6836
1 17.10 01.10 6836
2 24.10 01.10 6836
3 31.10 01.10 6836
4 07.11 01.11 6232
5 14.11 01.11 6232
6 21.11 01.11 6232
7 28.11 01.11 6232
8 05.12 01.12 6494
Though I have not thought it through, in SQL this would probably be really easy. Using some near-pseudo SQL, the operation would likely be something of this nature:
SELECT df1.*FROM df1, df2
JOIN df2.sale/df2.nbr AS "monthlyObjAvg" WHERE df1.time_ref = df2.time
In Pandas I had a much harder time to solve and even research this problem, since all search engine results only lead to either .map() functions, or conditional column selection problems. Note that classical condition selection of the like df1[df1["time_ref"] == df2["time"]]["sale"] can not be applied, because comparisons between two dataframes are illegal in Pandas. My instinct was also that Pandas probably had some detection feature that noticed the existence of surjective unambiguous mapping and then rationalized such an expression, but that turned out to be false.
Note that I had already solved this problem using loops before this. Looks like this:
advIdx = 0
for n in range(df1.shape[0]):
for m in range(advIdx, df2.shape[0]):
if df1['time_ref'][n] == df2['time'][m]:
df1.loc[n, 'monthlyObjAvg'] = df2.loc[m, 'sale'] / df2.loc[m, 'nbr']
advIdx = m
break
Employing a forward moving index (since old times are never relevant again), one can even reduce the complexity from n*m to roughly n+m. Yet even with such a dramatic improvement, applying the loop solution to datasets of 10,000-1,000,000+ rows still takes a couple seconds to even minutes to run through, which means it still yet cries for a proper vectorized solution.
While it took a while, I eventually figured out what the Pandas function that managed to simulate the conditional argument. Although it can't do it very straightforwardly, pandas.merge pretty much implements the SQL statement above. You have to rename the columns that the "join"/merge will be applied to, which is df2.time here:
df2.rename(columns = {'time':'time_ref'}, inplace=True)
Then apply the leftside join:
df1 = pd.merge(df1, df2, how='left', on=['time_ref'])
And finally create the target column and drop the rest:
df1['monthlyObjAvg'] = df1['sale']/df2['nbr']
df1.drop(["time_ref", "sale", "nbr"], axis=1, inplace=True)
This properly produces the lightning fast solution I searched for (runs in millisecond range on a 50,000+ sample), but it still seems kind of inelegant. I wanted to leave this here as a buoy for future me-like people wondering about this and using these search terms to find nothing much, since it is a proper solution after all.
If anyone can donate a more elegant and intuitive way to do this (e.g. directly projecting the fraction into the place that fulfills the conditions, instead of copying all, then calculating, then deleting the copies), then feel welcome to do it.
Objective:
I need to show the trend in ageing of issues. e.g. for each date in 2021 show the average age of the issues that were open as at that date.
Starting data (Historic issue list):. "df"
ref
Created
resolved
a1
20/4/2021
15/5/2021
a2
21/4/2021
15/6/2021
a3
23/4/2021
15/7/2021
Endpoint: "df2"
Date
Avg_age
1/1/2021
x
2/1/2021
y
3/1/2021
z
where x,y,z are the calculated averages of age for all issues open on the Date.
Tried so far:
I got this to work in what feels like a very poor way.
create a date range (pd.date_range(start,finish,freq="D")
I loop through the dates in this range and for each date I filter the "df" dataframe (boolean filtering) to show only issues live on the date in question. Then calc age (date - created) and average for those. Each result appended to a list.
once done I just convert the list into a series for my final result, which I can then graph or whatever.
hist_dates = pd.date_range(start="2021-01-01",end="2021-12-31"),freq="D")
result_list = []
for each_date in hist_dates:
f1=df.Created < each_date #filter1
f2=df.Resolved >= each_date #filter2
df['Refdate'] = each_date #make column to allow refdate-created
df['Age']= (df.Refdate - df.Created)
results_list.append(df[f1 & f2]).Age.mean())
Problems:
This works, but it feels sloppy and it doesn't seem fast. The current data-set is small, but I suspect this wouldn't scale well. I'm trying not to solve everything with loops as I understand it is a common mistake for beginners like me.
I'll give you two solutions: the first one is step-by-step for you to understand the idea and process, the second one replicates the functionality in a much more condensed way, skipping some intermediate steps
First, create a new column that holds your issue age, i.e. df['age'] = df.resolved - df.Created (I'm assuming your columns are of datetime type, if not, use pd.to_datetime to convert them)
You can then use groupby to group your data by creation date. This will internally slice your dataframe into several pieces, one for each distinct value of Created, grouping all values with the same creation date together. This way, you can then use aggregation on a creation date level to get the average issue age like so
# [['Created', 'age']] selects only the columns you are interested in
df[['Created', 'age']].groupby('Created').mean()
With an additional fourth data point [a4, 2021/4/20, 2021/4/30] (to enable some proper aggregation), this would end up giving you the following Series with the average issue age by creation date:
age
Created
2021-04-20 17 days 12:00:00
2021-04-21 55 days 00:00:00
2021-04-23 83 days 00:00:00
A more condensed way of doing this is by defining a custom function and apply it to each creation date grouping
def issue_age(s: pd.Series):
return (s['resolved'] - s['Created']).mean()
df.groupby('Created').apply(issue_age)
This call will give you the same Series as before.
I am currently working on a course in Data Science on how to win data science competitions. The final project is a Kaggle competition that we have to participate in.
My training dataset has close to 3 million rows, and one of the columns is a "date of purchase" column.
I want to calculate the distance of each date to the nearest public holiday.
E.g. if the date is 31/12/2014, the nearest PH would be 01/01/2015. The number of days apart would be "1".
I cannot think of an efficient way to do this operation. I have a list with a number of Timestamps, each one is a public holiday in Russia (the dataset is from Russia).
def dateDifference (target_date_raw):
abs_deltas_from_target_date = np.subtract(russian_public_holidays, target_date_raw)
abs_deltas_from_target_date = [i.days for i in abs_deltas_from_target_date if i.days >= 0]
index_of_min_delta_from_target_date = np.min(abs_deltas_from_target_date)
return index_of_min_delta_from_target_date
where 'russian_public_holidays' is the list of public holiday dates and 'target_date_raw' is the date for which I want to calculate distance to the nearest public holiday.
This is the code I use to create a new column in my DataFrame for the difference of dates.
training_data['closest_public_holiday'] = [dateDifference(i) for i in training_data['date']]
This code ran for nearly 25 minutes and showed no signs of completing, which is why I turn to you guys for help.
I understand that this is probably the least Pandorable way of doing things, but I couldn't really find a clean way of operating on a single column during my research. I saw a lot of people say that using the "apply" function on a single column is a bad way of doing things. I am very new to working with such large datasets, which is why clean and efficient practices seem to elude me for now. Please do let me know what would be the best way to tackle this!
Try this and see if helps with the timing. I worry that it will take up to much memory. I don't have the data to test. You can try.
df = pd.DataFrame(pd.date_range('01/01/2021','12/31/2021',freq='M'),columns=['Date'])
holidays = pd.to_datetime(np.array(['1/1/2021','12/25/2021','8/9/2021'])).to_numpy()
Assuming holidays: 1/1/2021, 8/9/2021, 12/25/2021
df['Days Away'] = (
np.min(np.absolute(df.Date.to_numpy()
.reshape(-1,1) - holidays),axis=1) /
np.timedelta64(1, 'D')
)
I'm currently working with CESM Large Ensemble data on the cloud (ala https://medium.com/pangeo/cesm-lens-on-aws-4e2a996397a1) using xarray and Dask and am trying to plot the trends in extreme precipitation in each season over the historical period (Dec-Jan-Feb and Jun-Jul-Aug specifically).
Eg. If one had a daily time-series data split into months like:
1920: J,F,M,A,M,J,J,A,S,O,N,D
1921: J,F,M,A,M,J,J,A,S,O,N,D
...
My aim is to group together the JJA days in each year and then take the maximum value within that group of days for each year. Ditto for DJF, however here you have to be careful because DJF is a year-skipping season; the most natural way to define it is 1921's DJF = 1920 D + 1921 JF.
Using iris this would be simple (though quite inefficient), as you could just add auxiliary time-coordinates for season and season_year and then aggregate/groupby those two coordinates and take a maximum, this would give you a (year, lat, lon) output where each year contains the maximum of the precipitation field in the chosen season (eg. maximum DJF precip in 1921 in each lat,lon pixel).
However in xarray this operation is not as natural because you can't natively groupby multiple coordinates, see https://github.com/pydata/xarray/issues/324 for further info on this. However, in this github issue someone suggests a simple, nested workaround to the problem using xarray's .apply() functionality:
def nested_groupby_apply(dataarray, groupby, apply_fn):
if len(groupby) == 1:
return dataarray.groupby(groupby[0]).apply(apply_fn)
else:
return dataarray.groupby(groupby[0]).apply(nested_groupby_apply, groupby = groupby[1:], apply_fn = apply_fn)
I'd be quite keen to try and use this workaround myself, but I have two main questions beforehand:
1) I can't seem to work out how to groupby coordinates such that I don't take the maximum of DJF in the same year?
Eg. If one simply applies the function like (for a suitable xr_max() function):
outp = nested_groupby_apply(daily_prect, ['time.season', 'time.year'], xr_max)
outp_djf = outp.sel(season='DJF')
Then you effectively define 1921's DJF as 1921 D + 1921 JF, which isn't actually what you want to look at! This is because the 'time.year' grouping doesn't account for the year-skipping behaviour of seasons like DJF. I'm not sure how to workaround this?
2) This nested groupby function is incredibly slow! As such, I was wondering if anyone in the community had found a more efficient solution to this problem, with similar functionality?
Thanks ahead of time for your help, all! Let me know if anything needs clarifying.
EDIT: Since posting this, I've discovered there already is a workaround for this in the specific case of taking DJF/JJA means each year (Take maximum rainfall value for each season over a time period (xarray)), however I'm keeping this question open because the general problem of an efficient workaround for multi-coord grouping is still unsolved.
I'm working on replacing an Excel financial model into Python Pandas. By financial model I mean forecasting a cash flow, profit & loss statement and balance sheet over time for a business venture as opposed to pricing swaps / options or working with stock price data that are also referred to as financial models. It's quite possible that the same concepts and issues apply to the latter types I just don't know them that well so can't comment.
So far I like a lot of what I see. The models I work with in Excel have a common time series across the top of the page, defining the time period we're interested in forecasting. Calculations then run down the page as a series of rows. Each row is therefore a TimeSeries object, or a collection of rows becomes a DataFrame. Obviously you need to transpose to read between these two constructs but this is a trivial transformation.
Better yet each Excel row should have a common, single formula and only be based on rows above on the page. This lends itself to vector operations that are computationally fast and simple to write using Pandas.
The issue I get is when I try to model a corkscrew-type calculation. These are often used to model accounting balances, where the opening balance for one period is the closing balance of the prior period. You can't use a .shift() operation as the closing balance in a given period depends, amongst other things, on the opening balance in the same period. This is probably best illustrated with an example:
Time 2013-04-01 2013-05-01 2013-06-01 2013-07-01 ...
Opening Balance 0 +3 -2 -10
[...]
Some Operations +3 -5 -8 +20
[...]
Closing Balance +3 -2 -10 +10
In pseudo-code my solution to how to calculate these sorts of things is a follows. It is not a vectorised solution and it looks like it is pretty slow
# Set up date range
dates = pd.date_range('2012-04-01',periods=500,freq='MS')
# Initialise empty lists
lOB = []
lSomeOp1 = []
lSomeOp2 = []
lCB = []
# Set the closing balance for the initial loop's OB
sCB = 0
# As this is a corkscrew calculation will need to loop through all dates
for d in dates:
# Create a datetime object as will reference it several times below
dt = d.to_datetime()
# Opening balance is either initial opening balance if at the
# initial date or else the last closing balance from prior
# period
sOB = inp['ob'] if (dt == obDate) else sCB
# Calculate some additions, write-off, amortisation, depereciation, whatever!
sSomeOp1 = 10
sSomeOp2 = -sOB / 2
# Calculate the closing balance
sCB = sOB + sSomeOp1 + sSomeOp2
# Build up list of outputs
lOB.append(sOB)
lSomeOp1.append(sSomeOp1)
lSomeOp2.append(sSomeOp2)
lCB.append(sCB)
# Convert lists to timeseries objects
ob = pd.Series(lOB, index=dates)
someOp1 = pd.Series(lSomeOp1, index=dates)
someOp2 = pd.Series(lSomeOp2, index=dates)
cb = pd.Series(lCB, index=dates)
I can see that where you only have one or two lines of operations there might be some clever hacks to vectorise the computation, I'd be grateful to hear any tips people have on doing these sorts of tricks.
Some of the corkscrews I have to build, however, have 100's of intermediate operations. In these cases what's my best way forward? Is it to accept the slow performance of Python? Should I migrate to Cython? I've not really looked into it (so could be way off base) but the issue with the latter approach is that if I'm moving 100's of lines into C why am I bothering with Python in the first place, it doesn't feel like a simple lift and shift?
This following makes in-place updates, which should improve performance
import pandas as pd
import numpy as np
book=pd.DataFrame([[0, 3, np.NaN],[np.NaN,-5,np.NaN],[np.NaN,-8,np.NaN],[np.NaN,+20,np.NaN]], columns=['ob','so','cb'], index=['2013-04-01', '2013-05-01', '2013-06-01', '2013-07-01'])
for row in book.index[:-1]:
book['cb'][row]=book.ix[row, ['ob', 'so']].sum()
book['ob'][book.index.get_loc(row)+1]=book['cb'][row]
book['cb'][book.index[-1]]=book.ix[book.index[-1], ['ob', 'so']].sum()
book