This question here is a follow up question occurred in the comments of Resampling on a multi index.
We start with following data:
data=pd.DataFrame({'dates':['2004','2008','2012'],'values':[k*(1+4*365) for k in range(3)]})
data['dates']=pd.to_datetime(data['dates'])
data=data.set_index('dates')
That is what it produces:
Now, when I resample and interpolate by
data.resample('A').mean().interpolate()
I obtain the following:
But what I want (and the problem is already the resampling and not the interpolation step) is
2004-12-31 365
2005-12-31 730
2006-12-31 1095
2007-12-31 1460
2008-12-31 1826
2009-12-31 2191
2010-12-31 2556
2011-12-31 2921
2012-12-31 3287
So I want an actual linear interpolation on the given data.
To make it even clearer I wrote a function which does the job. However, I'm still looking for a build in solution (my own function is bad coding because of a very ugly runtime):
def fillResampleCorrectly(data,resample):
for i in range(len(resample)):
currentDate=resample.index[i]
for j in range(len(data)):
if currentDate>=data.index[j]:
if j<len(data)-1:
continue
valueBefore=data[data.columns[0]].iloc[j-1]
valueAfter=data[data.columns[0]].iloc[j]
dateBefore=data.index[j-1]
dateAfter=data.index[j]
currentValue=valueBefore+(valueAfter-valueBefore)*((currentDate-dateBefore)/(dateAfter-dateBefore))
resample[data.columns[0]].iloc[i]=currentValue
break
I don't find a direct way for your exact output. The issue is the resampling between the 01-01 and 31-12 of the first year.
You can however mimick the result with:
out = data.resample('A', label='right').mean().interpolate(method='time') + 365
Or:
s = data.resample('A', label='right').mean().interpolate(method='time')
out = s + (s.index[0] - data.index[0]).days
Output:
values
dates
2004-12-31 365.0
2005-12-31 730.0
2006-12-31 1095.0
2007-12-31 1460.0
2008-12-31 1826.0
2009-12-31 2191.0
2010-12-31 2556.0
2011-12-31 2921.0
2012-12-31 3287.0
What is “actual” interpolation? You are considering leap years, which makes this a non-linear relationship.
Generating a df that starts with the end of the year (and accounts for 2004 as a leap year):
data = pd.DataFrame({'dates': ['2004-12-31', '2008-12-31', '2012-12-31'], 'values': [
366 + k * (1 + 4 * 365) for k in range(3)]})
data['dates'] = pd.to_datetime(data['dates'])
data = data.set_index('dates')
values
dates
2004-12-31 366
2008-12-31 1827
2012-12-31 3288
Resample and interpolation as before (data = data.resample('A').mean().interpolate()). By the way, A in resample is end of year, and AS is start of year.
If we look at the difference between each step (data - data.shift(1)), we get:
values
dates
2004-12-31 NaN
2005-12-31 365.25
2006-12-31 365.25
2007-12-31 365.25
2008-12-31 365.25
2009-12-31 365.25
2010-12-31 365.25
2011-12-31 365.25
2012-12-31 365.25
As we would expect from a linear interpolation.
The desired result can be achieved by applying np.floor to the results:
data.resample('A').mean().interpolate().apply(np.floor)
values
dates
2004-12-31 366.0
2005-12-31 731.0
2006-12-31 1096.0
2007-12-31 1461.0
2008-12-31 1827.0
2009-12-31 2192.0
2010-12-31 2557.0
2011-12-31 2922.0
2012-12-31 3288.0
And the difference data - data.shift(1):
values
dates
2004-12-31 NaN
2005-12-31 365.0
2006-12-31 365.0
2007-12-31 365.0
2008-12-31 366.0
2009-12-31 365.0
2010-12-31 365.0
2011-12-31 365.0
2012-12-31 366.0
A non-linear relationship caused by the leap year.
I just came up with an idea and it works:
dailyData=data.asfreq('D').interpolate()
dailyData.groupby(dailyData.index.year).tail(1)
Only for the last year the wrong date is chosen, but that is completely fine for me. The important thing is that the days match to the values.
Related
is there a way to calculate a rolling mean on a descending time series without sorting it into a ascending one?
Original time series with same timestamp order as in csv file.
pd.read_csv(data_dir+items+extension, parse_dates=True, index_col='timestamp').sort_index(ascending=False)
timestamp open
2021-05-06 90.000
2021-05-05 93.600
2021-05-04 90.840
2021-05-03 91.700
2021-04-30 91.355
Rolling mean
stock_dict[items]["SMA100"]=pd.Series(stock_dict[items]["close"]).rolling(window=100).mean()
ascending = False
open high low close volume SMA100
timestamp
2021-05-06 90.000 93.5200 89.64 93.03 8024053 NaN
2021-05-05 93.600 94.7700 90.00 90.08 13079308 NaN
2021-05-04 90.840 90.9700 87.44 88.69 15147509 NaN
2021-05-03 91.700 92.0200 90.79 91.15 6641764 NaN
2021-04-30 91.355 91.9868 90.89 91.19 6614347 NaN
... ... ... ... ... ... ...
1999-11-05 14.560 15.5000 14.50 15.38 1308267 14.9245
1999-11-04 14.690 14.7500 14.25 14.62 207033 14.9395
1999-11-03 14.310 14.5000 14.12 14.50 61600 14.9526
1999-11-02 14.250 15.0000 14.16 14.25 128817 14.9639
1999-11-01 14.190 14.3800 13.94 14.06 173233 14.9682
ascending = True
open high low close volume SMA100
timestamp
1999-11-01 14.190 14.3800 13.94 14.06 173233 NaN
1999-11-02 14.250 15.0000 14.16 14.25 128817 NaN
1999-11-03 14.310 14.5000 14.12 14.50 61600 NaN
1999-11-04 14.690 14.7500 14.25 14.62 207033 NaN
1999-11-05 14.560 15.5000 14.50 15.38 1308267 NaN
... ... ... ... ... ... ...
2021-04-30 91.355 91.9868 90.89 91.19 6614347 93.1148
2021-05-03 91.700 92.0200 90.79 91.15 6641764 93.2036
2021-05-04 90.840 90.9700 87.44 88.69 15147509 93.2542
2021-05-05 93.600 94.7700 90.00 90.08 13079308 93.3292
2021-05-06 90.000 93.5200 89.64 93.03 8024053 93.4284
As time series goes from 1999 to 2012 rolling mean is correct in case of ascending = True.
So either I have to change sorting of data, which I would like to avoid, or I have somehow to tell rolling mean function to start with last entry and calculate backwards.
Considering two timeseries as pandas.Series:
tser_a:
date
2016-05-25 13:30:00.023 50.41
2016-05-26 13:30:00.023 51.96
2016-05-27 13:30:00.030 51.98
2016-05-28 13:30:00.041 52.00
2016-05-29 13:30:00.048 52.01
2016-06-02 13:30:00.049 51.97
2016-06-03 13:30:00.072 52.01
2016-06-04 13:30:00.075 52.10
tser_b:
date
2016-05-24 13:30:00.023 74.41
2016-05-25 13:30:00.023 74.96
2016-05-26 13:30:00.030 74.98
2016-05-27 13:30:00.041 73.00
2016-05-28 13:30:00.048 73.01
2016-05-29 13:30:00.049 73.97
2016-06-02 13:30:00.072 72.01
2016-06-03 13:30:00.075 72.10
I would like to calculate the correlation between these two timeseries.
Pandas does offer the pandas.Series.corr (ref) function to compute such a value.
corr = tser_a.corr(tser_b)
My doubt:
However, I need to be sure that the correlation takes into account the exact same date for each value, thus considering only the intersection between tser_a and tser_b.
As pseudocode:
if ((tser_a[date_x] IS NOT NIL) AND (tser_b[date_x] IS NOT NIL)):
then: consider(tser_a[date_x], tser_b[date_x])
else:
then: skip and go ahead
Then:
tser_a -> 2016-05-24 13:30:00.023 74.41
tser_b -> 2016-06-04 13:30:00.075 52.10
Must be excluded.
Does pandas.Series.corr assume this behaviour by default or should I first intersect the two timeseries accoring to the date?
It looks like tser_a.corr(tser_b) does match the indices. However, since the two data might not have exact same timestamps, you would get unexpected outcome. In stead, you can use resample first:
tser_a.resample('D').mean().corr(tser_b.resample('D').mean())
# out -0.5522781562573792
I'm curious as to what last() and first() does in this specific instance (when chained to a resampling). Correct me if I'm wrong, but I understand if you pass arguments into first and last, e.g. 3; it returns the first 3 months or first 3 years.
In this circumstance, since I'm not passing any arguments into first() and last(), what is it actually doing when I'm resampling it like that? I know that if I resample by chaining .mean(), I'll resample into years with the mean score from averaging all the months, but what is happening when I'm using last()?
More importantly, why does first() and last() give me different answers in this context? I see that numerically they are not equal.
i.e: post2008.resample().first() != post2008.resample().last()
TLDR:
What does .first() and .last() do?
What does .first() and .last() do in this instance, when chained to a resample?
Why does .resample().first() != .resample().last()?
This is the code before the aggregation:
# Read 'GDP.csv' into a DataFrame: gdp
gdp = pd.read_csv('GDP.csv', index_col='DATE', parse_dates=True)
# Slice all the gdp data from 2008 onward: post2008
post2008 = gdp.loc['2008-01-01':,:]
# Print the last 8 rows of post2008
print(post2008.tail(8))
This is what print(post2008.tail(8)) outputs:
VALUE
DATE
2014-07-01 17569.4
2014-10-01 17692.2
2015-01-01 17783.6
2015-04-01 17998.3
2015-07-01 18141.9
2015-10-01 18222.8
2016-01-01 18281.6
2016-04-01 18436.5
Here is the code that resamples and aggregates by last():
# Resample post2008 by year, keeping last(): yearly
yearly = post2008.resample('A').last()
print(yearly)
This is what yearly is like when it's post2008.resample('A').last():
VALUE
DATE
2008-12-31 14549.9
2009-12-31 14566.5
2010-12-31 15230.2
2011-12-31 15785.3
2012-12-31 16297.3
2013-12-31 16999.9
2014-12-31 17692.2
2015-12-31 18222.8
2016-12-31 18436.5
Here is the code that resamples and aggregates by first():
# Resample post2008 by year, keeping first(): yearly
yearly = post2008.resample('A').first()
print(yearly)
This is what yearly is like when it's post2008.resample('A').first():
VALUE
DATE
2008-12-31 14668.4
2009-12-31 14383.9
2010-12-31 14681.1
2011-12-31 15238.4
2012-12-31 15973.9
2013-12-31 16475.4
2014-12-31 17025.2
2015-12-31 17783.6
2016-12-31 18281.6
Before anything else, let's create a dataframe with example data:
import pandas as pd
dates = pd.DatetimeIndex(['2014-07-01', '2014-10-01', '2015-01-01',
'2015-04-01', '2015-07-01', '2015-07-01',
'2016-01-01', '2016-04-01'])
df = pd.DataFrame({'VALUE': range(1000, 9000, 1000)}, index=dates)
print(df)
The output will be
VALUE
2014-07-01 1000
2014-10-01 2000
2015-01-01 3000
2015-04-01 4000
2015-07-01 5000
2015-07-01 6000
2016-01-01 7000
2016-04-01 8000
If we pass e.g. '6M' to df.first (which is not an aggregator, but a DataFrame method), we will be selecting the first six months of data, which in the example above consists of just two days:
print(df.first('6M'))
VALUE
2014-07-01 1000
2014-10-01 2000
Similarly, last returns only the rows that belong to the last six months of data:
print(df.last('6M'))
VALUE
2016-01-01 6000
2016-04-01 7000
In this context, not passing the required argument results in an error:
print(df.first())
TypeError: first() missing 1 required positional argument: 'offset'
On the other hand, df.resample('Y') returns a Resampler object, which has aggregation methods first, last, mean, etc. In this case, they keep only the first (respectively, last) values of each year (instead of e.g. averaging all values, or some other kind of aggregation):
print(df.resample('Y').first())
VALUE
2014-12-31 1000
2015-12-31 3000 # This is the first of the 4 values from 2015
2016-12-31 7000
print(df.resample('Y').last())
VALUE
2014-12-31 2000
2015-12-31 6000 # This is the last of the 4 values from 2015
2016-12-31 8000
As an extra example, consider also the case of grouping by a smaller period:
print(df.resample('M').last().head())
VALUE
2014-07-31 1000.0 # This is the last (and only) value from July, 2014
2014-08-31 NaN # No data for August, 2014
2014-09-30 NaN # No data for September, 2014
2014-10-31 2000.0
2014-11-30 NaN # No data for November, 2014
In this case, any periods for which there is no value will be filled with NaNs. Also, for this example, using first instead of last would have returned the same values, since each month has (at most) one value.
Can somone explain what is going on with my resampling?
For example,
In [53]: daily_3mo_treasury.resample('5Y').mean()
Out[53]:
1993-12-31 2.997120
1998-12-31 4.917730
2003-12-31 3.297176
2008-12-31 2.997204
2013-12-31 0.097330
2018-12-31 0.534476
Where the last date in my time series is 2018-08-23 2.04
I really want my resample from the most recent year-end instead, so for example from 2017-12-31 to 2012-12-31 and so on.
I tried,
end = daily_3mo_treasury.index.searchsorted(date(2017,12,31))
daily_3mo_treasury.iloc[:end].resample('5Y').mean()
In [66]: daily_3mo_treasury.iloc[:end].resample('5Y').mean()
Out[66]:
1993-12-31 2.997120
1998-12-31 4.917730
2003-12-31 3.297176
2008-12-31 2.997204
2013-12-31 0.097330
2018-12-31 0.333467
dtype: float64
Where the last value in daily_3mo_treasury.iloc[:end] is 2017-12-29 1.37
How come my second 5 year resample is not ending 2017-12-31?
Edit: My index is sorted.
From #ALollz - When you resample, the bins are based on the first date in your index.
sistart = daily_3mo_treasury.index.searchsorted(date(1992,12,31))
siend = daily_3mo_treasury.index.searchsorted(date(2017,12,31))
In [95]: daily_3mo_treasury.iloc[sistart:siend].resample('5Y').mean()
Out[95]:
1992-12-31 3.080000
1997-12-31 4.562246
2002-12-31 4.050696
2007-12-31 2.925971
2012-12-31 0.360775
2017-12-31 0.278233
dtype: float64
How do I resample a dataframe with a daily time-series index to yearly, but not from 1st Jan to 31th Dec. Instead I want the yearly sum from 1.June to 31.May.
First I did this, which gives me the yearly sum from 1.Jan to 31.Dec:
df.resample(rule='A').sum()
I have tried using the base-parameter, but it does not change the resample sum.
df.resample(rule='A', base=100).sum()
Here is a part of my dataframe:
In []: df
Out[]:
Index ET P R
2010-01-01 00:00:00 -0.013 0.0 0.773
2010-01-02 00:00:00 0.0737 0.21 0.797
2010-01-03 00:00:00 -0.048 0.0 0.926
...
In []: df.resample(rule='A', base = 0, label='left').sum()
Out []:
Index
2009-12-31 00:00:00 424.131138 871.48 541.677405
2010-12-31 00:00:00 405.625780 939.06 575.163096
2011-12-31 00:00:00 461.586365 1064.82 710.507947
...
I would really appreciate if anyone could help me figuring out how to do this.
Thank you
Use 'AS-JUN' as the rule with resample:
# Example data
idx = pd.date_range('2017-01-01', '2018-12-31')
s = pd.Series(1, idx)
# Resample
s = s.resample('AS-JUN').sum()
The resulting output:
2016-06-01 151
2017-06-01 365
2018-06-01 214
Freq: AS-JUN, dtype: int64