Forcing dates to conform to a given frequency in pandas - python

Suppose we have a monthly time series, possibly with missing months, and upon loading the data into a pandas Series object with DatetimeIndex we wish to make sure each date observation is labeled as an end-of-month date. However, the raw input dates may fall anywhere in the month, so we need to force them to end-of-month observations.
My first thought was to do something like this:
import pandas as pd
pd.DatetimeIndex([datetime(2012,1,20), datetime(2012,7,31)], freq='M')
However, this just leaves the dates as is [2012-01-20,2012-07-31] and does not force them to end-of-month values [2012-01-31,2012-07-31].
My second attempt was:
ix = pd.DatetimeIndex([datetime(2012,1,20), datetime(2012,7,31)], freq='M')
s = pd.Series(np.random.randn(len(ix)), index=ix)
s.asfreq('M')
But this gives:
2012-01-31 NaN
2012-02-29 NaN
2012-03-31 NaN
2012-04-30 NaN
2012-05-31 NaN
2012-06-30 NaN
2012-07-31 0.79173
Freq: M
as under the hood the asfreq function is calling date_range for a DatetimeIndex.
This problem is easily solved if I'm using PeriodIndex instead of DatetimeIndex; however, I need to support some frequencies that are not currently supported by PeriodIndex and as far as I know there is no way to extend pandas with my own Period frequencies.

It's a workaround, but it works without using periodindex:
from pandas.tseries.offsets import *
In [164]: s
Out[164]:
2012-01-20 -1.266376
2012-07-31 -0.865573
In [165]: s.index=s.index+MonthEnd(n=0)
In [166]: s
Out[166]:
2012-01-31 -1.266376
2012-07-31 -0.865573

Related

Why does date_range give a result different from indexing [] for DataFrame Pandas dates?

Here is a simple code with date_range and indexing [ ] I used with Pandas
period_start = '2013-01-01'
period_end = '2019-12-24'
print(pd.DataFrame ({'close':aapl_close,
'returns':aapl_returns},index=pd.date_range(start=period_start,periods=6)))
print(pd.DataFrame ({'close':aapl_close,
'returns':aapl_returns})[period_start:'20130110'])
date_range gives Nan results
close returns
2013-01-01 NaN NaN
2013-01-02 NaN NaN
2013-01-03 NaN NaN
2013-01-04 NaN NaN
Indexing gives correct results
close returns
2013-01-02 00:00:00+00:00 68.732 0.028322
2013-01-03 00:00:00+00:00 68.032 -0.010184
2013-01-04 00:00:00+00:00 66.091 -0.028531
Based on how the dates are shown by date_range - I suppose the date format of date_range does not match the date format in the Pandas DataFrame.
1) Can you explaine please why it gives NaN?
2) What would you suggest to get a specific time range from the Panda DataFrame?
As I'm a beginner in Python and its libraries, I didn't understand that this question refers to the Quantopian library, not to Pandas.
I got a solution on their forum. All the times returned by methods on Quantopian are timezone aware with a timezone of 'UTC'. By default, the date_range method returns timezone naive dates. Simply supply the timezone information to date_range method. Like this
pd.DataFrame ({
'close':aapl_close,
'returns':aapl_returns,},
index=pd.date_range(start=period_start, periods=6, tz='UTC'))
To get a specific date or time range in pandas perhaps the easiest is simple bracket notation. For example, to get dates between 2013-01-04 and 2013-01-08 (inclusive) simply enter this:
df = pd.DataFrame ({'close':aapl_close, 'returns':aapl_returns,})
my_selected_dates = df['2013-01-04':'2013-01-08']
This bracket notation is really shorthand for using the loc method
my_selected_dates = df.loc['2013-01-04':'2013-01-08']
Both work the same but the loc method has a bit more flexibility. This notation also works with datetimes if desired.

Is datetime data in Pandas supposed to be in the index?

By supposed to, what I mean is
Is that the way Pandas is designed?, Are all Pandas time series functions built upon that assumption?
A few weeks ago I was experimenting with pandas.rolling_mean which seemed to want the datetime to be in the index.
Given a dataframe like this:
df = pd.DataFrame({'date' : ['23/10/2017', '24/10/2017', '25/10/2017','26/10/2017','27/10/2017'], 'dax-close' : [13003.14, 13013.19, 12953.41,13133.28,13217.54]})
df['date'] = pd.to_datetime(df['date'])
df
...is it important to always do this:
df.set_index('date', inplace=True)
df
...as one of the first steps of an analysis?
The short answer is usually timeseries data has date as a DatetimeIndex. and many pandas functions do make use of that e.g. resample is a big one.
That said, you don't need to have Dates as an index, for example you may even have multiple Datetime columns, then you're out of luck calling the vanilla resample... however you can use pd.Grouper to define the "resample" on a column (or as part of a larger/multi-column groupby)
In [11]: df.groupby(pd.Grouper(key="date", freq="2D")).sum()
Out[11]:
dax-close
date
2017-10-23 26016.33
2017-10-25 26086.69
2017-10-27 13217.54
In [12]: df.set_index("date").resample("2D").sum()
Out[12]:
dax-close
date
2017-10-23 26016.33
2017-10-25 26086.69
2017-10-27 13217.54
The former gives more flexibility in that you can groupby multiple columns:
In [21]: df["X"] = list("AABAC")
In [22]: df.groupby(["X", pd.Grouper(key="date", freq="2D")]).sum()
Out[22]:
dax-close
X date
A 2017-10-23 26016.33
2017-10-25 13133.28
B 2017-10-25 12953.41
C 2017-10-27 13217.54

For half-hourly intervals can I use Pandas TimeDeltaIndex, PeriodIndex or DateTimeIndex?

I have a table of data values that should be indexed with half-hourly intervals, and I've been processing them with Pandas and Numpy. Currently they're in CSV files and I import them using read_csv to a dataframe with only the interval-endpoint as an index. I am uncomfortable with that and want to have the intervals themselves as the index.
I do not know whether to use a DateTimeIndex, a PeriodIndex or a TimedeltaIndex... All of them seem very similar in practice, to me. My operations include
Looking up a particular interval
Checking if a DateTime is contained in a particular interval
Intersection and (Set)Difference of intervals
Split and join intervals
Can Pandas even do all of these? Is it advisable? I already am using this interval library, would using Pandas tslib and period be better?
if you only need a series with time interval of 30 minutes you can do this:
import pandas as pd
import datetime as dt
today = dt.datetime.date()
yesterday = dt.datetime.date()-dt.timedelta(days=1)
time_range = pd.date_range(yesterday,today, freq='30T')
now you could use it to set an index such has
pd.DataFrame(0, index=time_range,columns=['yourcol'])
Out[35]:
yourcol
2016-09-25 00:00:00 0
2016-09-25 00:30:00 0
2016-09-25 01:00:00 0
2016-09-25 01:30:00 0
2016-09-25 02:00:00 0
this would be a DateTimeIndex
you can read more about time interval in pandas here: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

Python (pandas) fast mapping of multiple datetimes to their series indices?

I have a large Pandas dataframe in which one column is (unordered) datetimes from a known period (the year 2013). I need an efficient way to convert these datetimes to indices, where each index = # hours since start_time ('2013-1-1 00)'. There are duplicate times, which should map to duplicate indices.
Obviously, this can be done one-at-a-time with a loop by using timedelta. It can also be done with a loop by using Pandas Series (see the following snippet, which generates the ordered series of all datetimes since start_time):
nhours = 365*24
time_series = Series(range(nhours), index=pd.date_range('2013-1-1', periods=nhours, freq='H'))
After running this snippet, one can get indices using the .index or .get_loc methods in a loop.
** However, is there a fast (non-loopy?) way to take a column of arbitrary datetimes and find their respective indices? **
For example, inputing the following column of datetimes:
2013-01-01 11:00:00
2013-01-01 11:00:00
2013-01-01 00:00:00
2013-12-30 18:00:00
should output the following indices: [11, 11, 0, 8730]
loc can take a list or array of labels to look up:
>>> print time_series.loc[[pd.Timestamp('20130101 11:00'), pd.Timestamp('20130101 11:00'), pd.Timestamp('20130101'), pd.Timestamp('20131230 18:00')]]
2013-01-01 11:00:00 11
2013-01-01 11:00:00 11
2013-01-01 00:00:00 0
2013-12-30 18:00:00 8730
dtype: int64
Thank you for the responses. I have a new, faster solution that takes advantage of the fact that pandas supports datetime and timedelta formats. It turns out that the following is roughly twice as fast as Colin's solution above (although not as flexible), and it avoids the overhead of building a Series of ordered datetimes:
all_indices = (df['mydatetimes'] - datetime(2013,1,1,0)) / np.timedelta64(1,'h')
where df is the pandas dataframe and 'mydatetimes' is the column name that includes the datetimes.
Timing the code yields that this solution performs 30,000 indices in:
0:00:00.009909 --> this snippet
0:00:00.017800 --> Colin's solution with ts=Series(...) and ts.loc. I have excluded the one-time overhead of building a Series from this timing
Use isin:
time_series[time_series.index.isin(['2013-01-01 11:00:00',
'2013-01-01 00:00:00',
'2013-12-30 18:00:00'])].values
# Returns: array([ 0, 11, 8730])
between and between_time are also useful

Handling monthly-binned data in pandas

I have a dataset I'm analyzing in pandas where all data is binned monthly. The data originates from a MySQL database where all dates are in the format 'YYYY-MM-01', such that, for example, all rows for October 2013 would have "2013-10-01" in the month column.
I'm currently reading the data into pandas (via a .tsv dump of the MySQL table) with
data = pd.read_table(filename,header=None,names=('uid','iid','artist','tag','date'),index_col=indexes, parse_dates='date')
This is all fine, except for the fact that any subsequent analyses I run in which I do monthly resampling always represents dates using the end-of-month convention (i.e. data from October becomes '2013-10-31' instead of '2013-10-01'), but this can lead to inconsistencies where the original data has months labeled as 'YYYY-MM-01', while any resampled data will have the months labeled as 'YYYY-MM-31' (or '-30' or '-28', as appropriate).
My question is this: What is the easiest and/or fastest way I can convert all the dates in my dataframe to the end-of-month format from the outset? Keep in mind that the date is one of several indexes in a multi-index, not a column. I think my best bet is to use a modified date_parser in my in my pd.read_table call that always converts month to the end-of-month convention, but I'm not sure how to approach it.
Read your dates in exactly like you are doing.
Create some test data. I am setting the dates to the start of month, but it doesn't matter.
In [39]: df = DataFrame(np.random.randn(10,2),columns=list('AB'),
index=date_range('20130101',periods=10,freq='MS'))
In [40]: df
Out[40]:
A B
2013-01-01 -0.553482 0.049128
2013-02-01 0.337975 -0.035897
2013-03-01 -0.394849 -1.755323
2013-04-01 -0.555638 1.903388
2013-05-01 -0.087752 1.551916
2013-06-01 1.000943 -0.361248
2013-07-01 -1.855171 -2.215276
2013-08-01 -0.582643 1.661696
2013-09-01 0.501061 -1.455171
2013-10-01 1.343630 -2.008060
Force convert them to the end-of-month in time space regardless of the day
In [41]: df.index = df.index.to_period().to_timestamp('M')
In [42]: df
Out[42]:
A B
2013-01-31 -0.553482 0.049128
2013-02-28 0.337975 -0.035897
2013-03-31 -0.394849 -1.755323
2013-04-30 -0.555638 1.903388
2013-05-31 -0.087752 1.551916
2013-06-30 1.000943 -0.361248
2013-07-31 -1.855171 -2.215276
2013-08-31 -0.582643 1.661696
2013-09-30 0.501061 -1.455171
2013-10-31 1.343630 -2.008060
Back to the start
In [43]: df.index = df.index.to_period().to_timestamp('MS')
In [44]: df
Out[44]:
A B
2013-01-01 -0.553482 0.049128
2013-02-01 0.337975 -0.035897
2013-03-01 -0.394849 -1.755323
2013-04-01 -0.555638 1.903388
2013-05-01 -0.087752 1.551916
2013-06-01 1.000943 -0.361248
2013-07-01 -1.855171 -2.215276
2013-08-01 -0.582643 1.661696
2013-09-01 0.501061 -1.455171
2013-10-01 1.343630 -2.008060
You can also work with (and resample) as periods
In [45]: df.index = df.index.to_period()
In [46]: df
Out[46]:
A B
2013-01 -0.553482 0.049128
2013-02 0.337975 -0.035897
2013-03 -0.394849 -1.755323
2013-04 -0.555638 1.903388
2013-05 -0.087752 1.551916
2013-06 1.000943 -0.361248
2013-07 -1.855171 -2.215276
2013-08 -0.582643 1.661696
2013-09 0.501061 -1.455171
2013-10 1.343630 -2.008060
use replace() to change the day value. and you can get the last day of month using
from datetime import date
import calendar
d = date(2000,1,1)
d = d.replace(day=calendar.monthrange(d.year, d.month)[1])
UPDATE
I add some example for pandas.
sample file date.csv
2013-01-01, 1
2013-02-01, 2
ipython shell log.
In [27]: import pandas as pd
In [28]: from datetime import datetime, date
In [29]: import calendar
In [30]: def parse(dt):
dt = datetime.strptime(dt, '%Y-%m-%d')
dt = dt.replace(day=calendar.monthrange(dt.year, dt.month)[1])
return dt.date()
....:
In [31]: parse('2013-01-01')
Out[31]: datetime.date(2013, 1, 31)
In [32]: r = pd.read_csv('date.csv', header=None, names=('date', 'value'), parse_dates=['date'], date_parser=parse)
In [33]: r
Out[33]:
date value
0 2013-01-31 1
1 2013-02-28 2

Categories