Can somone explain what is going on with my resampling?
For example,
In [53]: daily_3mo_treasury.resample('5Y').mean()
Out[53]:
1993-12-31 2.997120
1998-12-31 4.917730
2003-12-31 3.297176
2008-12-31 2.997204
2013-12-31 0.097330
2018-12-31 0.534476
Where the last date in my time series is 2018-08-23 2.04
I really want my resample from the most recent year-end instead, so for example from 2017-12-31 to 2012-12-31 and so on.
I tried,
end = daily_3mo_treasury.index.searchsorted(date(2017,12,31))
daily_3mo_treasury.iloc[:end].resample('5Y').mean()
In [66]: daily_3mo_treasury.iloc[:end].resample('5Y').mean()
Out[66]:
1993-12-31 2.997120
1998-12-31 4.917730
2003-12-31 3.297176
2008-12-31 2.997204
2013-12-31 0.097330
2018-12-31 0.333467
dtype: float64
Where the last value in daily_3mo_treasury.iloc[:end] is 2017-12-29 1.37
How come my second 5 year resample is not ending 2017-12-31?
Edit: My index is sorted.
From #ALollz - When you resample, the bins are based on the first date in your index.
sistart = daily_3mo_treasury.index.searchsorted(date(1992,12,31))
siend = daily_3mo_treasury.index.searchsorted(date(2017,12,31))
In [95]: daily_3mo_treasury.iloc[sistart:siend].resample('5Y').mean()
Out[95]:
1992-12-31 3.080000
1997-12-31 4.562246
2002-12-31 4.050696
2007-12-31 2.925971
2012-12-31 0.360775
2017-12-31 0.278233
dtype: float64
Related
This question here is a follow up question occurred in the comments of Resampling on a multi index.
We start with following data:
data=pd.DataFrame({'dates':['2004','2008','2012'],'values':[k*(1+4*365) for k in range(3)]})
data['dates']=pd.to_datetime(data['dates'])
data=data.set_index('dates')
That is what it produces:
Now, when I resample and interpolate by
data.resample('A').mean().interpolate()
I obtain the following:
But what I want (and the problem is already the resampling and not the interpolation step) is
2004-12-31 365
2005-12-31 730
2006-12-31 1095
2007-12-31 1460
2008-12-31 1826
2009-12-31 2191
2010-12-31 2556
2011-12-31 2921
2012-12-31 3287
So I want an actual linear interpolation on the given data.
To make it even clearer I wrote a function which does the job. However, I'm still looking for a build in solution (my own function is bad coding because of a very ugly runtime):
def fillResampleCorrectly(data,resample):
for i in range(len(resample)):
currentDate=resample.index[i]
for j in range(len(data)):
if currentDate>=data.index[j]:
if j<len(data)-1:
continue
valueBefore=data[data.columns[0]].iloc[j-1]
valueAfter=data[data.columns[0]].iloc[j]
dateBefore=data.index[j-1]
dateAfter=data.index[j]
currentValue=valueBefore+(valueAfter-valueBefore)*((currentDate-dateBefore)/(dateAfter-dateBefore))
resample[data.columns[0]].iloc[i]=currentValue
break
I don't find a direct way for your exact output. The issue is the resampling between the 01-01 and 31-12 of the first year.
You can however mimick the result with:
out = data.resample('A', label='right').mean().interpolate(method='time') + 365
Or:
s = data.resample('A', label='right').mean().interpolate(method='time')
out = s + (s.index[0] - data.index[0]).days
Output:
values
dates
2004-12-31 365.0
2005-12-31 730.0
2006-12-31 1095.0
2007-12-31 1460.0
2008-12-31 1826.0
2009-12-31 2191.0
2010-12-31 2556.0
2011-12-31 2921.0
2012-12-31 3287.0
What is “actual” interpolation? You are considering leap years, which makes this a non-linear relationship.
Generating a df that starts with the end of the year (and accounts for 2004 as a leap year):
data = pd.DataFrame({'dates': ['2004-12-31', '2008-12-31', '2012-12-31'], 'values': [
366 + k * (1 + 4 * 365) for k in range(3)]})
data['dates'] = pd.to_datetime(data['dates'])
data = data.set_index('dates')
values
dates
2004-12-31 366
2008-12-31 1827
2012-12-31 3288
Resample and interpolation as before (data = data.resample('A').mean().interpolate()). By the way, A in resample is end of year, and AS is start of year.
If we look at the difference between each step (data - data.shift(1)), we get:
values
dates
2004-12-31 NaN
2005-12-31 365.25
2006-12-31 365.25
2007-12-31 365.25
2008-12-31 365.25
2009-12-31 365.25
2010-12-31 365.25
2011-12-31 365.25
2012-12-31 365.25
As we would expect from a linear interpolation.
The desired result can be achieved by applying np.floor to the results:
data.resample('A').mean().interpolate().apply(np.floor)
values
dates
2004-12-31 366.0
2005-12-31 731.0
2006-12-31 1096.0
2007-12-31 1461.0
2008-12-31 1827.0
2009-12-31 2192.0
2010-12-31 2557.0
2011-12-31 2922.0
2012-12-31 3288.0
And the difference data - data.shift(1):
values
dates
2004-12-31 NaN
2005-12-31 365.0
2006-12-31 365.0
2007-12-31 365.0
2008-12-31 366.0
2009-12-31 365.0
2010-12-31 365.0
2011-12-31 365.0
2012-12-31 366.0
A non-linear relationship caused by the leap year.
I just came up with an idea and it works:
dailyData=data.asfreq('D').interpolate()
dailyData.groupby(dailyData.index.year).tail(1)
Only for the last year the wrong date is chosen, but that is completely fine for me. The important thing is that the days match to the values.
I'm curious as to what last() and first() does in this specific instance (when chained to a resampling). Correct me if I'm wrong, but I understand if you pass arguments into first and last, e.g. 3; it returns the first 3 months or first 3 years.
In this circumstance, since I'm not passing any arguments into first() and last(), what is it actually doing when I'm resampling it like that? I know that if I resample by chaining .mean(), I'll resample into years with the mean score from averaging all the months, but what is happening when I'm using last()?
More importantly, why does first() and last() give me different answers in this context? I see that numerically they are not equal.
i.e: post2008.resample().first() != post2008.resample().last()
TLDR:
What does .first() and .last() do?
What does .first() and .last() do in this instance, when chained to a resample?
Why does .resample().first() != .resample().last()?
This is the code before the aggregation:
# Read 'GDP.csv' into a DataFrame: gdp
gdp = pd.read_csv('GDP.csv', index_col='DATE', parse_dates=True)
# Slice all the gdp data from 2008 onward: post2008
post2008 = gdp.loc['2008-01-01':,:]
# Print the last 8 rows of post2008
print(post2008.tail(8))
This is what print(post2008.tail(8)) outputs:
VALUE
DATE
2014-07-01 17569.4
2014-10-01 17692.2
2015-01-01 17783.6
2015-04-01 17998.3
2015-07-01 18141.9
2015-10-01 18222.8
2016-01-01 18281.6
2016-04-01 18436.5
Here is the code that resamples and aggregates by last():
# Resample post2008 by year, keeping last(): yearly
yearly = post2008.resample('A').last()
print(yearly)
This is what yearly is like when it's post2008.resample('A').last():
VALUE
DATE
2008-12-31 14549.9
2009-12-31 14566.5
2010-12-31 15230.2
2011-12-31 15785.3
2012-12-31 16297.3
2013-12-31 16999.9
2014-12-31 17692.2
2015-12-31 18222.8
2016-12-31 18436.5
Here is the code that resamples and aggregates by first():
# Resample post2008 by year, keeping first(): yearly
yearly = post2008.resample('A').first()
print(yearly)
This is what yearly is like when it's post2008.resample('A').first():
VALUE
DATE
2008-12-31 14668.4
2009-12-31 14383.9
2010-12-31 14681.1
2011-12-31 15238.4
2012-12-31 15973.9
2013-12-31 16475.4
2014-12-31 17025.2
2015-12-31 17783.6
2016-12-31 18281.6
Before anything else, let's create a dataframe with example data:
import pandas as pd
dates = pd.DatetimeIndex(['2014-07-01', '2014-10-01', '2015-01-01',
'2015-04-01', '2015-07-01', '2015-07-01',
'2016-01-01', '2016-04-01'])
df = pd.DataFrame({'VALUE': range(1000, 9000, 1000)}, index=dates)
print(df)
The output will be
VALUE
2014-07-01 1000
2014-10-01 2000
2015-01-01 3000
2015-04-01 4000
2015-07-01 5000
2015-07-01 6000
2016-01-01 7000
2016-04-01 8000
If we pass e.g. '6M' to df.first (which is not an aggregator, but a DataFrame method), we will be selecting the first six months of data, which in the example above consists of just two days:
print(df.first('6M'))
VALUE
2014-07-01 1000
2014-10-01 2000
Similarly, last returns only the rows that belong to the last six months of data:
print(df.last('6M'))
VALUE
2016-01-01 6000
2016-04-01 7000
In this context, not passing the required argument results in an error:
print(df.first())
TypeError: first() missing 1 required positional argument: 'offset'
On the other hand, df.resample('Y') returns a Resampler object, which has aggregation methods first, last, mean, etc. In this case, they keep only the first (respectively, last) values of each year (instead of e.g. averaging all values, or some other kind of aggregation):
print(df.resample('Y').first())
VALUE
2014-12-31 1000
2015-12-31 3000 # This is the first of the 4 values from 2015
2016-12-31 7000
print(df.resample('Y').last())
VALUE
2014-12-31 2000
2015-12-31 6000 # This is the last of the 4 values from 2015
2016-12-31 8000
As an extra example, consider also the case of grouping by a smaller period:
print(df.resample('M').last().head())
VALUE
2014-07-31 1000.0 # This is the last (and only) value from July, 2014
2014-08-31 NaN # No data for August, 2014
2014-09-30 NaN # No data for September, 2014
2014-10-31 2000.0
2014-11-30 NaN # No data for November, 2014
In this case, any periods for which there is no value will be filled with NaNs. Also, for this example, using first instead of last would have returned the same values, since each month has (at most) one value.
I'm having a dataframe with multiindex, the first level is an company_ID and the second level is a timestamp. How can I get a rank of all companies depending on their scores, every month?
Score
company_idx timestamp
10006 2010-01-31 69.875394
2010-11-30 73.640693
2010-12-31 73.286248
2011-01-31 73.660052
2011-02-28 74.615564
2011-03-31 73.535187
2011-04-30 72.491390
2012-01-31 72.162768
2012-02-29 61.637952
2012-03-31 59.445419
2012-04-30 25.685615
2012-05-31 8.047693
2012-06-30 58.341200
...
9981 2016-12-31 51.011261
2018-05-31 54.462832
2018-06-30 57.126250
2018-07-31 54.695835
2018-08-31 63.758145
2018-09-30 63.255583
2018-10-31 62.069697
2018-11-30 62.795650
2018-12-31 63.045329
2019-01-31 60.276990
2019-02-28 56.666379
2019-03-31 57.903213
2019-04-30 57.558973
2019-05-31 52.260287
I've tried to do:
df2 = df.sort_index(by='Score', ascending=False)
But it's not getting me what i want.
Would you be able to help? I'm quite new with multilevel dataframes.
Many thanks!
You should swap the index levels to have the month first, then sort by timestamp ascending and Score descending:
df.index = df.index.swaplevel()
df.sort_values(['timestamp', 'Score'], ascending=[True, False], inplace=True)
It does not give interesting result with your sample value, because only one company has Score value for one month.
To extract the values for one month, you can use df.xs(month_value, level=0) that will drop one level in the multi-index, or df.xs(month_value, level=0, drop_level=False) that will keep it.
How do I resample a dataframe with a daily time-series index to yearly, but not from 1st Jan to 31th Dec. Instead I want the yearly sum from 1.June to 31.May.
First I did this, which gives me the yearly sum from 1.Jan to 31.Dec:
df.resample(rule='A').sum()
I have tried using the base-parameter, but it does not change the resample sum.
df.resample(rule='A', base=100).sum()
Here is a part of my dataframe:
In []: df
Out[]:
Index ET P R
2010-01-01 00:00:00 -0.013 0.0 0.773
2010-01-02 00:00:00 0.0737 0.21 0.797
2010-01-03 00:00:00 -0.048 0.0 0.926
...
In []: df.resample(rule='A', base = 0, label='left').sum()
Out []:
Index
2009-12-31 00:00:00 424.131138 871.48 541.677405
2010-12-31 00:00:00 405.625780 939.06 575.163096
2011-12-31 00:00:00 461.586365 1064.82 710.507947
...
I would really appreciate if anyone could help me figuring out how to do this.
Thank you
Use 'AS-JUN' as the rule with resample:
# Example data
idx = pd.date_range('2017-01-01', '2018-12-31')
s = pd.Series(1, idx)
# Resample
s = s.resample('AS-JUN').sum()
The resulting output:
2016-06-01 151
2017-06-01 365
2018-06-01 214
Freq: AS-JUN, dtype: int64
I have a dataframe as follows
df = pd.DataFrame({ 'X' : np.random.randn(50000)}, index=pd.date_range('1/1/2000', periods=50000, freq='T'))
df.head(10)
Out[37]:
X
2000-01-01 00:00:00 -0.699565
2000-01-01 00:01:00 -0.646129
2000-01-01 00:02:00 1.339314
2000-01-01 00:03:00 0.559563
2000-01-01 00:04:00 1.529063
2000-01-01 00:05:00 0.131740
2000-01-01 00:06:00 1.282263
2000-01-01 00:07:00 -1.003991
2000-01-01 00:08:00 -1.594918
2000-01-01 00:09:00 -0.775230
I would like to create a variable that contains the sum of X
over the last 5 days (not including the current observation)
only considering observations that fall at the exact same hour as the current observation.
In other words:
At index 2000-01-01 00:00:00, df['rolling_sum_same_hour'] contains the sum the values of X observed at 00:00:00 during the last 5 days in the data (not including 2000-01-01 of course).
At index 2000-01-01 00:01:00, df['rolling_sum_same_hour'] contains the sum of of X observed at 00:00:01 during the last 5 days and so on.
The intuitive idea is that intraday prices have intraday seasonality, and I want to get rid of it that way.
I tried to use df['rolling_sum_same_hour']=df.at_time(df.index.minute).rolling(window=5).sum()
with no success.
Any ideas?
Many thanks!
Behold the power of groupby!
df = # as you defined above
df['rolling_sum_by_time'] = df.groupby(df.index.time)['X'].apply(lambda x: x.shift(1).rolling(10).sum())
It's a big pill to swallow there, but we are grouping by time (as in python datetime.time), then getting the column we care about (else apply will work on columns - it now works on the time-groups), and then applying the function you want!
IIUC, what you want is to perform a rolling sum, but only on the observations grouped by the exact same time of day. This can be done by
df.X.groupby([df.index.hour, df.index.minute]).apply(lambda g: g.rolling(window=5).sum())
(Note that your question alternates between 5 and 10 periods.) For example:
In [43]: df.X.groupby([df.index.hour, df.index.minute]).apply(lambda g: g.rolling(window=5).sum()).tail()
Out[43]:
2000-02-04 17:15:00 -2.135887
2000-02-04 17:16:00 -3.056707
2000-02-04 17:17:00 0.813798
2000-02-04 17:18:00 -1.092548
2000-02-04 17:19:00 -0.997104
Freq: T, Name: X, dtype: float64