Python (pandas) fast mapping of multiple datetimes to their series indices? - python

I have a large Pandas dataframe in which one column is (unordered) datetimes from a known period (the year 2013). I need an efficient way to convert these datetimes to indices, where each index = # hours since start_time ('2013-1-1 00)'. There are duplicate times, which should map to duplicate indices.
Obviously, this can be done one-at-a-time with a loop by using timedelta. It can also be done with a loop by using Pandas Series (see the following snippet, which generates the ordered series of all datetimes since start_time):
nhours = 365*24
time_series = Series(range(nhours), index=pd.date_range('2013-1-1', periods=nhours, freq='H'))
After running this snippet, one can get indices using the .index or .get_loc methods in a loop.
** However, is there a fast (non-loopy?) way to take a column of arbitrary datetimes and find their respective indices? **
For example, inputing the following column of datetimes:
2013-01-01 11:00:00
2013-01-01 11:00:00
2013-01-01 00:00:00
2013-12-30 18:00:00
should output the following indices: [11, 11, 0, 8730]

loc can take a list or array of labels to look up:
>>> print time_series.loc[[pd.Timestamp('20130101 11:00'), pd.Timestamp('20130101 11:00'), pd.Timestamp('20130101'), pd.Timestamp('20131230 18:00')]]
2013-01-01 11:00:00 11
2013-01-01 11:00:00 11
2013-01-01 00:00:00 0
2013-12-30 18:00:00 8730
dtype: int64

Thank you for the responses. I have a new, faster solution that takes advantage of the fact that pandas supports datetime and timedelta formats. It turns out that the following is roughly twice as fast as Colin's solution above (although not as flexible), and it avoids the overhead of building a Series of ordered datetimes:
all_indices = (df['mydatetimes'] - datetime(2013,1,1,0)) / np.timedelta64(1,'h')
where df is the pandas dataframe and 'mydatetimes' is the column name that includes the datetimes.
Timing the code yields that this solution performs 30,000 indices in:
0:00:00.009909 --> this snippet
0:00:00.017800 --> Colin's solution with ts=Series(...) and ts.loc. I have excluded the one-time overhead of building a Series from this timing

Use isin:
time_series[time_series.index.isin(['2013-01-01 11:00:00',
'2013-01-01 00:00:00',
'2013-12-30 18:00:00'])].values
# Returns: array([ 0, 11, 8730])
between and between_time are also useful

Related

How to select a subset of pandas DateTimeIndex whose data are in a list?

Lets say I have a idx=pd.DatatimeIndex with one minute frequency. I also have a list of bad dates (each are of type pd.Timestamp without the time information) that I want to remove from the original idx. How do I do that in pandas?
Use normalize to remove the time part from your index so you can do a simple ~ + isin selection, i.e. find the dates not in that bad list. You can further ensure your list of dates don't have a time part with the same [x.normalize() for x in bad_dates] if you need to be extra safe.
Sample Data
import pandas as pd
df = pd.DataFrame(range(9), index=pd.date_range('2010-01-01', freq='11H', periods=9))
bad_dates = [pd.Timestamp('2010-01-02'), pd.Timestamp('2010-01-03')]
Code
df[~df.index.normalize().isin(bad_dates)]
# 0
#2010-01-01 00:00:00 0
#2010-01-01 11:00:00 1
#2010-01-01 22:00:00 2
#2010-01-04 05:00:00 7
#2010-01-04 16:00:00 8

Why does date_range give a result different from indexing [] for DataFrame Pandas dates?

Here is a simple code with date_range and indexing [ ] I used with Pandas
period_start = '2013-01-01'
period_end = '2019-12-24'
print(pd.DataFrame ({'close':aapl_close,
'returns':aapl_returns},index=pd.date_range(start=period_start,periods=6)))
print(pd.DataFrame ({'close':aapl_close,
'returns':aapl_returns})[period_start:'20130110'])
date_range gives Nan results
close returns
2013-01-01 NaN NaN
2013-01-02 NaN NaN
2013-01-03 NaN NaN
2013-01-04 NaN NaN
Indexing gives correct results
close returns
2013-01-02 00:00:00+00:00 68.732 0.028322
2013-01-03 00:00:00+00:00 68.032 -0.010184
2013-01-04 00:00:00+00:00 66.091 -0.028531
Based on how the dates are shown by date_range - I suppose the date format of date_range does not match the date format in the Pandas DataFrame.
1) Can you explaine please why it gives NaN?
2) What would you suggest to get a specific time range from the Panda DataFrame?
As I'm a beginner in Python and its libraries, I didn't understand that this question refers to the Quantopian library, not to Pandas.
I got a solution on their forum. All the times returned by methods on Quantopian are timezone aware with a timezone of 'UTC'. By default, the date_range method returns timezone naive dates. Simply supply the timezone information to date_range method. Like this
pd.DataFrame ({
'close':aapl_close,
'returns':aapl_returns,},
index=pd.date_range(start=period_start, periods=6, tz='UTC'))
To get a specific date or time range in pandas perhaps the easiest is simple bracket notation. For example, to get dates between 2013-01-04 and 2013-01-08 (inclusive) simply enter this:
df = pd.DataFrame ({'close':aapl_close, 'returns':aapl_returns,})
my_selected_dates = df['2013-01-04':'2013-01-08']
This bracket notation is really shorthand for using the loc method
my_selected_dates = df.loc['2013-01-04':'2013-01-08']
Both work the same but the loc method has a bit more flexibility. This notation also works with datetimes if desired.

For half-hourly intervals can I use Pandas TimeDeltaIndex, PeriodIndex or DateTimeIndex?

I have a table of data values that should be indexed with half-hourly intervals, and I've been processing them with Pandas and Numpy. Currently they're in CSV files and I import them using read_csv to a dataframe with only the interval-endpoint as an index. I am uncomfortable with that and want to have the intervals themselves as the index.
I do not know whether to use a DateTimeIndex, a PeriodIndex or a TimedeltaIndex... All of them seem very similar in practice, to me. My operations include
Looking up a particular interval
Checking if a DateTime is contained in a particular interval
Intersection and (Set)Difference of intervals
Split and join intervals
Can Pandas even do all of these? Is it advisable? I already am using this interval library, would using Pandas tslib and period be better?
if you only need a series with time interval of 30 minutes you can do this:
import pandas as pd
import datetime as dt
today = dt.datetime.date()
yesterday = dt.datetime.date()-dt.timedelta(days=1)
time_range = pd.date_range(yesterday,today, freq='30T')
now you could use it to set an index such has
pd.DataFrame(0, index=time_range,columns=['yourcol'])
Out[35]:
yourcol
2016-09-25 00:00:00 0
2016-09-25 00:30:00 0
2016-09-25 01:00:00 0
2016-09-25 01:30:00 0
2016-09-25 02:00:00 0
this would be a DateTimeIndex
you can read more about time interval in pandas here: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

What representation should I use in Pandas for data valid throughout an interval?

I have a series of hourly prices. Each price is valid throughout the whole 1-hour period. What is the best way to represent these prices in Pandas that would enable me to index them in arbitrary higher frequencies (such as minutes or seconds) and do arithmetic with them?
Data specifics
Sample prices might be:
>>> prices = Series(randn(5), pd.date_range('2013-01-01 12:00', periods = 5, freq='H'))
>>> prices
2013-01-01 12:00:00 -1.001692
2013-01-01 13:00:00 -1.408082
2013-01-01 14:00:00 -0.329637
2013-01-01 15:00:00 1.005882
2013-01-01 16:00:00 1.202557
Freq: H
Now, what representation to use if I want the value at 13:37:42(I expect it to be the same as at 13:00)?
>>> prices['2013-01-01 13:37:42']
...
KeyError: <Timestamp: 2013-01-01 13:37:42>
Resampling
I know I could resample the prices and fill in the details (ffill, right?), but that doesn't seem like such a nice solution, because I have to assume the frequency I'm going to be indexing it at and it reduces readability with too many unnecessary data points.
Time spans
At first glance a PeriodIndex seems to work
>>> price_periods = prices.to_period()
>>> price_periods['2013-01-01 13:37:42']
-1.408082
But a time-spanned series doesn't offer some of the other functionality I expect from a Series. Say that I have another series amounts that says how many items I bought in a certain moment. If I wanted to calculate the prices I would want to multiply the two series'
>>> amounts = Series([1,2,2], pd.DatetimeIndex(['2013-01-01 13:37', '2013-01-01 13:57', '2013-01-01 14:05']))
>>> amounts*price_periods
but that yields an exception and sometimes even freezes my IPy Notebook. Indexing doesn't help either.
>>> ts_periods[amounts.index]
Are PeriodIndex structures still a work in progress or these features aren't going to be added? Is there maybe some other structure I should have used (or should use for now, before PeriodIndex matures)? I'm using Pandas version 0.9.0.dev-1e68fd9.
Check asof
prices.asof('2013-01-01 13:37:42')
returns the value for the previous available datetime:
prices['2013-01-01 13:00:00']
To make calculations, you can use:
prices.asof(amounts.index) * amounts
which returns a Series with amount's Index and the respective values:
>>> prices
2013-01-01 12:00:00 0.943607
2013-01-01 13:00:00 -1.019452
2013-01-01 14:00:00 -0.279136
2013-01-01 15:00:00 1.013548
2013-01-01 16:00:00 0.929920
>>> prices.asof(amounts.index)
2013-01-01 13:37:00 -1.019452
2013-01-01 13:57:00 -1.019452
2013-01-01 14:05:00 -0.279136
>>> prices.asof(amounts.index) * amounts
2013-01-01 13:37:00 -1.019452
2013-01-01 13:57:00 -2.038904
2013-01-01 14:05:00 -0.558272

Forcing dates to conform to a given frequency in pandas

Suppose we have a monthly time series, possibly with missing months, and upon loading the data into a pandas Series object with DatetimeIndex we wish to make sure each date observation is labeled as an end-of-month date. However, the raw input dates may fall anywhere in the month, so we need to force them to end-of-month observations.
My first thought was to do something like this:
import pandas as pd
pd.DatetimeIndex([datetime(2012,1,20), datetime(2012,7,31)], freq='M')
However, this just leaves the dates as is [2012-01-20,2012-07-31] and does not force them to end-of-month values [2012-01-31,2012-07-31].
My second attempt was:
ix = pd.DatetimeIndex([datetime(2012,1,20), datetime(2012,7,31)], freq='M')
s = pd.Series(np.random.randn(len(ix)), index=ix)
s.asfreq('M')
But this gives:
2012-01-31 NaN
2012-02-29 NaN
2012-03-31 NaN
2012-04-30 NaN
2012-05-31 NaN
2012-06-30 NaN
2012-07-31 0.79173
Freq: M
as under the hood the asfreq function is calling date_range for a DatetimeIndex.
This problem is easily solved if I'm using PeriodIndex instead of DatetimeIndex; however, I need to support some frequencies that are not currently supported by PeriodIndex and as far as I know there is no way to extend pandas with my own Period frequencies.
It's a workaround, but it works without using periodindex:
from pandas.tseries.offsets import *
In [164]: s
Out[164]:
2012-01-20 -1.266376
2012-07-31 -0.865573
In [165]: s.index=s.index+MonthEnd(n=0)
In [166]: s
Out[166]:
2012-01-31 -1.266376
2012-07-31 -0.865573

Categories