Python pandas time series w/ hierarchical indexing and rolling/ shifting - python

Struggling with pandas' rolling and shifting concept. There are many good suggestions including in this forum but I failed miserably to apply these to my scenario.
Now I use traditional looping over the time series but ugh, it took like 8 hours to iterate over 150,000 rows which is about 3 days of data for all tickers. Got 2 months data to process it probably won't finish after I come back from a sabbatical, not mentioning risk of electricity cut off after which I'd have to start over again this time no sabbatical while waiting.
I have the following 15 min stock price time series (Hierarchical index on datetime(timestamp) and ticker, the only original column is closePrice):
closePrice
datetime ticker
2014-02-04 09:15:00 AAPL xxx
EQIX xxx
FB xxx
GOOG xxx
MSFT xxx
2014-02-04 09:30:00 AAPL xxx
EQIX xxx
FB xxx
GOOG xxx
MSFT xxx
2014-02-04 09:45:00 AAPL xxx
EQIX xxx
FB xxx
GOOG xxx
MSFT xxx
I need to add two columns:
12sma, 12 days moving average. Having searched SO for hours the best suggestion would be to use rolling_mean, so I tried. But it didn't work given my TS structure i.e. it works top down the first MA is calculated based on the first 12 rows regardless of different ticker values. How do I make it average based on the index i.e. first datetime then ticker so I get MA for say AAPL? Currently it does (AAPL+EQIX+FB+GOOG+MSFT+AAPL...up to 12th row) / 12
Once I got the 12sma column, I need 12ema column, 12 days exponential MA. For the calculation, the first value in the time series for each ticker would just copy 12sma value from the same row. Subsequently, I'd need closePrice from the same row and 12ema from the previous row i.e. past 15 min. I did a long research seems like the solution would be a combination of rolling and shifting but I can't figure out how to put it together.
Any help I'd be grateful.
Thanks.
EDIT:
Thanks to Jeff's tips, after swapping and sorting ix level I am able to get the 12sma right with rolling_mean() and with a effort managed to insert the first 12ema value copied from 12sma at the same timestamp:
close 12sma 12ema
sec_code datetime
AAPL 2014-02-05 11:45:00 113.0 NaN NaN
2014-02-05 12:00:00 113.2 NaN NaN
2014-02-05 13:15:00 112.9 NaN NaN
2014-02-05 13:30:00 113.2 NaN NaN
2014-02-05 13:45:00 113.0 NaN NaN
2014-02-05 14:00:00 113.1 NaN NaN
2014-02-05 14:15:00 113.3 NaN NaN
2014-02-05 14:30:00 113.3 NaN NaN
2014-02-05 14:45:00 113.3 NaN NaN
2014-02-05 15:00:00 113.2 NaN NaN
2014-02-05 15:15:00 113.2 NaN NaN
2014-02-05 15:30:00 113.3 113.16 113.16
2014-02-05 15:45:00 113.3 113.19 NaN
2014-02-05 16:00:00 113.2 113.19 NaN
2014-02-06 09:45:00 112.6 113.16 NaN
2014-02-06 10:00:00 113.5 113.19 NaN
2014-02-06 10:15:00 113.8 113.25 NaN
2014-02-06 10:30:00 113.5 113.29 NaN
2014-02-06 10:45:00 113.7 113.32 NaN
2014-02-06 11:00:00 113.5 113.34 Nan
I understand pandas has pandas.stats.moments.ewma but I prefer to use a formula I got from a book which needs close price 'at the moment' and 12ema from previous row.
So, I tried to fill 12ema column from Feb 5, 15:45 and onward. I tried apply() with a function but shift gave an error:
def f12ema(x):
K = 2 / (12 + 1)
return x['price_nom'] * K + x['12ema'].shift(-1) * (1-K)
df1.apply(f12ema, axis=1)
AttributeError: ("'numpy.float64' object has no attribute 'shift'", u'occurred at index 2014-02-05 11:45:00')
Another possibility that crossed my mind is rolling_appy() but it is beyond my knowledge.

Create a date range inclusive of the times you want
In [47]: rng = date_range('20130102 09:30:00','20130105 16:00:00',freq='15T')
In [48]: rng = rng.take(rng.indexer_between_time('09:30:00','16:00:00'))
In [49]: rng
Out[49]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-02 09:30:00, ..., 2013-01-05 16:00:00]
Length: 108, Freq: None, Timezone: None
Create a frame similar to yours (2000 tickers x dates)
In [50]: df = DataFrame(np.random.randn(len(rng)*2000,1),columns=['close'],index=MultiIndex.from_product([rng,range(2000)],names=['date','ticker']))
Reorder the levels so that its ticker x date for the index, SORT IT!!!!
In [51]: df = df.swaplevel('ticker','date').sortlevel()
In [52]: df
Out[52]:
close
ticker date
0 2013-01-02 09:30:00 0.840767
2013-01-02 09:45:00 1.808101
2013-01-02 10:00:00 -0.718354
2013-01-02 10:15:00 -0.484502
2013-01-02 10:30:00 0.563363
2013-01-02 10:45:00 0.553920
2013-01-02 11:00:00 1.266992
2013-01-02 11:15:00 -0.641117
2013-01-02 11:30:00 -0.574673
2013-01-02 11:45:00 0.861825
2013-01-02 12:00:00 -1.562111
2013-01-02 12:15:00 -0.072966
2013-01-02 12:30:00 0.673079
2013-01-02 12:45:00 0.766105
2013-01-02 13:00:00 0.086202
2013-01-02 13:15:00 0.949205
2013-01-02 13:30:00 -0.381194
2013-01-02 13:45:00 0.316813
2013-01-02 14:00:00 -0.620176
2013-01-02 14:15:00 -0.193126
2013-01-02 14:30:00 -1.552111
2013-01-02 14:45:00 1.724429
2013-01-02 15:00:00 -0.092393
2013-01-02 15:15:00 0.197763
2013-01-02 15:30:00 0.064541
2013-01-02 15:45:00 -1.574853
2013-01-02 16:00:00 -1.023979
2013-01-03 09:30:00 -0.079349
2013-01-03 09:45:00 -0.749285
2013-01-03 10:00:00 0.652721
2013-01-03 10:15:00 -0.818152
2013-01-03 10:30:00 0.674068
2013-01-03 10:45:00 2.302714
2013-01-03 11:00:00 0.614686
...
[216000 rows x 1 columns]
Groupby the ticker. Return a DataFrame for each ticker that is the application of rolling_mean and ewma. Note that are many options for controlling this, e.g windowing, you could make it not wrap around days, etc.
In [53]: df.groupby(level='ticker')['close'].apply(lambda x: concat({ 'spma' : pd.rolling_mean(x,3), 'ema' : pd.ewma(x,3) }, axis=1))
Out[53]:
ema spma
ticker date
0 2013-01-02 09:30:00 0.840767 NaN
2013-01-02 09:45:00 1.393529 NaN
2013-01-02 10:00:00 0.480282 0.643504
2013-01-02 10:15:00 0.127447 0.201748
2013-01-02 10:30:00 0.270334 -0.213164
2013-01-02 10:45:00 0.356580 0.210927
2013-01-02 11:00:00 0.619245 0.794758
2013-01-02 11:15:00 0.269100 0.393265
2013-01-02 11:30:00 0.041032 0.017067
2013-01-02 11:45:00 0.258476 -0.117988
2013-01-02 12:00:00 -0.216742 -0.424986
2013-01-02 12:15:00 -0.179622 -0.257750
2013-01-02 12:30:00 0.038741 -0.320666
2013-01-02 12:45:00 0.223881 0.455406
2013-01-02 13:00:00 0.188995 0.508462
2013-01-02 13:15:00 0.380972 0.600504
2013-01-02 13:30:00 0.188987 0.218071
2013-01-02 13:45:00 0.221125 0.294942
2013-01-02 14:00:00 0.009907 -0.228185
2013-01-02 14:15:00 -0.041013 -0.165496
2013-01-02 14:30:00 -0.419688 -0.788471
2013-01-02 14:45:00 0.117299 -0.006936
2013-01-04 10:00:00 -0.060415 0.341013
2013-01-04 10:15:00 0.074068 0.604611
2013-01-04 10:30:00 -0.108502 0.440256
2013-01-04 10:45:00 -0.514229 -0.636702
... ...
[216000 rows x 2 columns]
Pretty good perf as its essentially looping over the tickers.
In [54]: %timeit df.groupby(level='ticker')['close'].apply(lambda x: concat({ 'spma' : pd.rolling_mean(x,3), 'ema' : pd.ewma(x,3) }, axis=1))
1 loops, best of 3: 2.1 s per loop

Related

Expand a time series in the form of numpy.array(), pandas.DataFrame(), or xarray.DataSet() to contain the missing records as NaN

import numpy as np
import pandas as pd
import xarray as xr
validIdx = np.ones(365*5, dtype= bool)
validIdx[np.random.randint(low=0, high=365*5, size=30)] = False
time = pd.date_range("2000-01-01", freq="H", periods=365 * 5)[validIdx]
data = np.arange(365 * 5)[validIdx]
ds = xr.Dataset({"foo": ("time", data), "time": time})
df = ds.to_dataframe()
In the above example, the time-series data ds (or df) has 30 randomly chosen missing records without having those as NaNs. Therefore, the length of data is 365x5 - 30, not 365x5).
My question is this: how can I expand the ds and df to have the 30 missing values as NaNs (so, the length will be 365x5)? For example, if a value in "2000-12-02" is missed in the example data, then it will look like:
...
2000-12-01 value 1
2000-12-03 value 2
...
, while what I want to have is:
...
2000-12-01 value 1
2000-12-02 NaN
2000-12-03 value 2
...
Perhaps you can try resample with 1 hour.
The df without NaNs (just after df = ds.to_dataframe()):
>>> df
foo
time
2000-01-01 00:00:00 0
2000-01-01 01:00:00 1
2000-01-01 02:00:00 2
2000-01-01 03:00:00 3
2000-01-01 04:00:00 4
... ...
2000-03-16 20:00:00 1820
2000-03-16 21:00:00 1821
2000-03-16 22:00:00 1822
2000-03-16 23:00:00 1823
2000-03-17 00:00:00 1824
[1795 rows x 1 columns]
The df with NaNs (df_1h):
>>> df_1h = df.resample('1H').mean()
>>> df_1h
foo
time
2000-01-01 00:00:00 0.0
2000-01-01 01:00:00 1.0
2000-01-01 02:00:00 2.0
2000-01-01 03:00:00 3.0
2000-01-01 04:00:00 4.0
... ...
2000-03-16 20:00:00 1820.0
2000-03-16 21:00:00 1821.0
2000-03-16 22:00:00 1822.0
2000-03-16 23:00:00 1823.0
2000-03-17 00:00:00 1824.0
[1825 rows x 1 columns]
Rows with NaN:
>>> df_1h[df_1h['foo'].isna()]
foo
time
2000-01-02 10:00:00 NaN
2000-01-04 07:00:00 NaN
2000-01-05 06:00:00 NaN
2000-01-09 02:00:00 NaN
2000-01-13 15:00:00 NaN
2000-01-16 16:00:00 NaN
2000-01-18 21:00:00 NaN
2000-01-21 22:00:00 NaN
2000-01-23 19:00:00 NaN
2000-01-24 01:00:00 NaN
2000-01-24 19:00:00 NaN
2000-01-27 12:00:00 NaN
2000-01-27 16:00:00 NaN
2000-01-29 06:00:00 NaN
2000-02-02 01:00:00 NaN
2000-02-06 13:00:00 NaN
2000-02-09 11:00:00 NaN
2000-02-15 12:00:00 NaN
2000-02-15 15:00:00 NaN
2000-02-21 04:00:00 NaN
2000-02-28 05:00:00 NaN
2000-02-28 06:00:00 NaN
2000-03-01 15:00:00 NaN
2000-03-02 18:00:00 NaN
2000-03-04 18:00:00 NaN
2000-03-05 20:00:00 NaN
2000-03-12 08:00:00 NaN
2000-03-13 20:00:00 NaN
2000-03-16 01:00:00 NaN
The number of NaNs in df_1h:
>>> df_1h.isnull().sum()
foo 30
dtype: int64

Reverse position of entries in pandas dataframe based on condition

Here I have an extract from my pandas dataframe which is survey data with two datetime fields. It appears that some of the start times and end times were filled in the wrong position in the survey. Here is an example from my dataframe. The start and end time in the 8th row, I suspect were entered the wrong way round.
Just to give context, I generated the third column like this:
df_time['trip_duration'] = df_time['tripEnd_time'] - df_time['tripStart_time']
The three columns are in timedelta64 format.
Here is the top of my dataframe:
tripStart_time tripEnd_time trip_duration
1 22:30:00 23:15:00 00:45:00
2 11:00:00 11:30:00 00:30:00
3 09:00:00 09:15:00 00:15:00
4 13:30:00 14:25:00 00:55:00
5 09:00:00 10:15:00 01:15:00
6 12:00:00 12:15:00 00:15:00
7 08:00:00 08:30:00 00:30:00
8 11:00:00 09:15:00 -1 days +22:15:00
9 14:00:00 14:30:00 00:30:00
10 14:55:00 15:20:00 00:25:00
What I am trying to do is, loop through these two columns, and for each time 'tripEnd_time' is less than 'tripStart_time' swap the positions of these two entries. So in the case of row 8 above, I would make tripStart_time = tripEnd_time and tripEnd_time = tripStart_time.
I am not quite sure the best way to approach this. Should I use nested for loop where i compare each entry in the two columns?
Thanks
Use Series.abs:
df_time['trip_duration'] = (df_time['tripEnd_time'] - df_time['tripStart_time']).abs()
print (df_time)
1 22:30:00 23:15:00 00:45:00
2 11:00:00 11:30:00 00:30:00
3 09:00:00 09:15:00 00:15:00
4 13:30:00 14:25:00 00:55:00
5 09:00:00 10:15:00 01:15:00
6 12:00:00 12:15:00 00:15:00
7 08:00:00 08:30:00 00:30:00
8 11:00:00 09:15:00 01:45:00
9 14:00:00 14:30:00 00:30:00
10 14:55:00 15:20:00 00:25:00
What is same like:
a = df_time['tripEnd_time'] - df_time['tripStart_time']
b = df_time['tripStart_time'] - df_time['tripEnd_time']
mask = df_time['tripEnd_time'] > df_time['tripStart_time']
df_time['trip_duration'] = np.where(mask, a, b)
print (df_time)
tripStart_time tripEnd_time trip_duration
1 22:30:00 23:15:00 00:45:00
2 11:00:00 11:30:00 00:30:00
3 09:00:00 09:15:00 00:15:00
4 13:30:00 14:25:00 00:55:00
5 09:00:00 10:15:00 01:15:00
6 12:00:00 12:15:00 00:15:00
7 08:00:00 08:30:00 00:30:00
8 11:00:00 09:15:00 01:45:00
9 14:00:00 14:30:00 00:30:00
10 14:55:00 15:20:00 00:25:00
You can switch column values on selected rows:
df_time.loc[df_time['tripEnd_time'] < df_time['tripStart_time'],
['tripStart_time', 'tripEnd_time']] = df_time.loc[
df_time['tripEnd_time'] < df_time['tripStart_time'],
['tripEnd_time', 'tripStart_time']].values

Fill datetimeindex gap by NaN

I have two dataframes which are datetimeindexed. One is missing a few of these datetimes (df1) while the other is complete (has regular timestamps without any gaps in this series) and is full of NaN's (df2).
I'm trying to match the values from df1 to the index of df2, filling with NaN's where such a datetimeindex doesn't exist in df1.
Example:
In [51]: df1
Out [51]: value
2015-01-01 14:00:00 20
2015-01-01 15:00:00 29
2015-01-01 16:00:00 41
2015-01-01 17:00:00 43
2015-01-01 18:00:00 26
2015-01-01 19:00:00 20
2015-01-01 20:00:00 31
2015-01-01 21:00:00 35
2015-01-01 22:00:00 39
2015-01-01 23:00:00 17
2015-03-01 00:00:00 6
2015-03-01 01:00:00 37
2015-03-01 02:00:00 56
2015-03-01 03:00:00 12
2015-03-01 04:00:00 41
2015-03-01 05:00:00 31
... ...
2018-12-25 23:00:00 41
<34843 rows × 1 columns>
In [52]: df2 = pd.DataFrame(data=None, index=pd.date_range(freq='60Min', start=df1.index.min(), end=df1.index.max()))
df2['value']=np.NaN
df2
Out [52]: value
2015-01-01 14:00:00 NaN
2015-01-01 15:00:00 NaN
2015-01-01 16:00:00 NaN
2015-01-01 17:00:00 NaN
2015-01-01 18:00:00 NaN
2015-01-01 19:00:00 NaN
2015-01-01 20:00:00 NaN
2015-01-01 21:00:00 NaN
2015-01-01 22:00:00 NaN
2015-01-01 23:00:00 NaN
2015-01-02 00:00:00 NaN
2015-01-02 01:00:00 NaN
2015-01-02 02:00:00 NaN
2015-01-02 03:00:00 NaN
2015-01-02 04:00:00 NaN
2015-01-02 05:00:00 NaN
... ...
2018-12-25 23:00:00 NaN
<34906 rows × 1 columns>
Using df2.combine_first(df1) returns the same data as df1.reindex(index= df2.index), which fills any gaps where there shouldn't be data with some value, instead of NaN.
In [53]: Result = df2.combine_first(df1)
Result
Out [53]: value
2015-01-01 14:00:00 20
2015-01-01 15:00:00 29
2015-01-01 16:00:00 41
2015-01-01 17:00:00 43
2015-01-01 18:00:00 26
2015-01-01 19:00:00 20
2015-01-01 20:00:00 31
2015-01-01 21:00:00 35
2015-01-01 22:00:00 39
2015-01-01 23:00:00 17
2015-01-02 00:00:00 35
2015-01-02 01:00:00 53
2015-01-02 02:00:00 28
2015-01-02 03:00:00 48
2015-01-02 04:00:00 42
2015-01-02 05:00:00 51
... ...
2018-12-25 23:00:00 41
<34906 rows × 1 columns>
This is what I was hoping to get:
Out [53]: value
2015-01-01 14:00:00 20
2015-01-01 15:00:00 29
2015-01-01 16:00:00 41
2015-01-01 17:00:00 43
2015-01-01 18:00:00 26
2015-01-01 19:00:00 20
2015-01-01 20:00:00 31
2015-01-01 21:00:00 35
2015-01-01 22:00:00 39
2015-01-01 23:00:00 17
2015-01-02 00:00:00 NaN
2015-01-02 01:00:00 NaN
2015-01-02 02:00:00 NaN
2015-01-02 03:00:00 NaN
2015-01-02 04:00:00 NaN
2015-01-02 05:00:00 NaN
... ...
2018-12-25 23:00:00 41
<34906 rows × 1 columns>
Could someone shed some light on why this is happening, and how to set how these values are filled?
IIUC you need resample df1, because you have an irregular frequency and you need regular frequency:
print df1.index.freq
None
print Result.index.freq
<60 * Minutes>
EDIT1
You can use function asfreq instead of resample - doc, resample vs asfreq.
EDIT2
First I think that resample didn't work, because after resampling the Result is the same as df1. But I try print df1.info() and print Result.info() gets different results - 34857 entries vs 34920 entries.
So I try to find rows with NaN values and it returns 63 rows.
So I think resample works well.
import pandas as pd
df1 = pd.read_csv('test/GapInTimestamps.csv', sep=",", index_col=[0], parse_dates=[0])
print df1.head()
# value
#Date/Time
#2015-01-01 00:00:00 52
#2015-01-01 01:00:00 5
#2015-01-01 02:00:00 12
#2015-01-01 03:00:00 54
#2015-01-01 04:00:00 47
print df1.info()
#<class 'pandas.core.frame.DataFrame'>
#DatetimeIndex: 34857 entries, 2015-01-01 00:00:00 to 2018-12-25 23:00:00
#Data columns (total 1 columns):
#value 34857 non-null int64
#dtypes: int64(1)
#memory usage: 544.6 KB
#None
Result = df1.resample('60min')
print Result.head()
# value
#Date/Time
#2015-01-01 00:00:00 52
#2015-01-01 01:00:00 5
#2015-01-01 02:00:00 12
#2015-01-01 03:00:00 54
#2015-01-01 04:00:00 47
print Result.info()
#<class 'pandas.core.frame.DataFrame'>
#DatetimeIndex: 34920 entries, 2015-01-01 00:00:00 to 2018-12-25 23:00:00
#Freq: 60T
#Data columns (total 1 columns):
#value 34857 non-null float64
#dtypes: float64(1)
#memory usage: 545.6 KB
#None
#find values with NaN
resultnan = Result[Result.isnull().any(axis=1)]
#temporaly display 999 rows and 15 columns
with pd.option_context('display.max_rows', 999, 'display.max_columns', 15):
print resultnan
# value
#Date/Time
#2015-01-13 19:00:00 NaN
#2015-01-13 20:00:00 NaN
#2015-01-13 21:00:00 NaN
#2015-01-13 22:00:00 NaN
#2015-01-13 23:00:00 NaN
#2015-01-14 00:00:00 NaN
#2015-01-14 01:00:00 NaN
#2015-01-14 02:00:00 NaN
#2015-01-14 03:00:00 NaN
#2015-01-14 04:00:00 NaN
#2015-01-14 05:00:00 NaN
#2015-01-14 06:00:00 NaN
#2015-01-14 07:00:00 NaN
#2015-01-14 08:00:00 NaN
#2015-01-14 09:00:00 NaN
#2015-02-01 00:00:00 NaN
#2015-02-01 01:00:00 NaN
#2015-02-01 02:00:00 NaN
#2015-02-01 03:00:00 NaN
#2015-02-01 04:00:00 NaN
#2015-02-01 05:00:00 NaN
#2015-02-01 06:00:00 NaN
#2015-02-01 07:00:00 NaN
#2015-02-01 08:00:00 NaN
#2015-02-01 09:00:00 NaN
#2015-02-01 10:00:00 NaN
#2015-02-01 11:00:00 NaN
#2015-02-01 12:00:00 NaN
#2015-02-01 13:00:00 NaN
#2015-02-01 14:00:00 NaN
#2015-02-01 15:00:00 NaN
#2015-02-01 16:00:00 NaN
#2015-02-01 17:00:00 NaN
#2015-02-01 18:00:00 NaN
#2015-02-01 19:00:00 NaN
#2015-02-01 20:00:00 NaN
#2015-02-01 21:00:00 NaN
#2015-02-01 22:00:00 NaN
#2015-02-01 23:00:00 NaN
#2015-11-01 00:00:00 NaN
#2015-11-01 01:00:00 NaN
#2015-11-01 02:00:00 NaN
#2015-11-01 03:00:00 NaN
#2015-11-01 04:00:00 NaN
#2015-11-01 05:00:00 NaN
#2015-11-01 06:00:00 NaN
#2015-11-01 07:00:00 NaN
#2015-11-01 08:00:00 NaN
#2015-11-01 09:00:00 NaN
#2015-11-01 10:00:00 NaN
#2015-11-01 11:00:00 NaN
#2015-11-01 12:00:00 NaN
#2015-11-01 13:00:00 NaN
#2015-11-01 14:00:00 NaN
#2015-11-01 15:00:00 NaN
#2015-11-01 16:00:00 NaN
#2015-11-01 17:00:00 NaN
#2015-11-01 18:00:00 NaN
#2015-11-01 19:00:00 NaN
#2015-11-01 20:00:00 NaN
#2015-11-01 21:00:00 NaN
#2015-11-01 22:00:00 NaN
#2015-11-01 23:00:00 NaN

Python: Delete specific timestamp index row (independently of date)

I have a DataFrame with this specific timestamps index:
2011-01-07 09:30:00
2011-01-07 09:35:00
2011-01-07 09:40:00
...
2011-01-08 09:30:00
2011-01-08 09:35:00
2011-01-08 09:40:00
...
2011-01-09 09:30:00
2011-01-09 09:35:00
2011-01-09 09:40:00
Without going through some kind of loop, is there a fast way to delete every row with the time 09:30:00 independently of the date?
Construct a test frame
In [28]: df = DataFrame(np.random.randn(400,1),index=date_range('20130101',periods=400,freq='15T'))
In [29]: df = df.take(df.index.indexer_between_time('9:00','10:00'))
In [30]: df
Out[30]:
0
2013-01-01 09:00:00 -1.452507
2013-01-01 09:15:00 -0.244847
2013-01-01 09:30:00 -0.654370
2013-01-01 09:45:00 -0.689975
2013-01-01 10:00:00 -1.506261
2013-01-02 09:00:00 -0.096923
2013-01-02 09:15:00 -1.371506
2013-01-02 09:30:00 1.481053
2013-01-02 09:45:00 0.327030
2013-01-02 10:00:00 1.614000
2013-01-03 09:00:00 -1.313668
2013-01-03 09:15:00 0.563914
2013-01-03 09:30:00 -0.117773
2013-01-03 09:45:00 0.309642
2013-01-03 10:00:00 -0.386824
2013-01-04 09:00:00 -1.245194
2013-01-04 09:15:00 0.930746
2013-01-04 09:30:00 1.088279
2013-01-04 09:45:00 -0.927087
2013-01-04 10:00:00 -1.098625
[20 rows x 1 columns]
The indexer_between_time returns the indexes that we want to remove, so just remove them from the original index (this is what an index - does).
In [31]: df.reindex(df.index-df.index.take(df.index.indexer_between_time('9:30:00','9:30:00')))
Out[31]:
0
2013-01-01 09:00:00 -1.452507
2013-01-01 09:15:00 -0.244847
2013-01-01 09:45:00 -0.689975
2013-01-01 10:00:00 -1.506261
2013-01-02 09:00:00 -0.096923
2013-01-02 09:15:00 -1.371506
2013-01-02 09:45:00 0.327030
2013-01-02 10:00:00 1.614000
2013-01-03 09:00:00 -1.313668
2013-01-03 09:15:00 0.563914
2013-01-03 09:45:00 0.309642
2013-01-03 10:00:00 -0.386824
2013-01-04 09:00:00 -1.245194
2013-01-04 09:15:00 0.930746
2013-01-04 09:45:00 -0.927087
2013-01-04 10:00:00 -1.098625
[16 rows x 1 columns]
You need to do something like -
>>> x = pd.DataFrame([[1,2,3,4],[3,3,3,3],[8,7,3,2],[9,9,9,4],[2,2,2,4]])
>>> x
0 1 2 3
0 1 2 3 4
1 3 3 3 3
2 8 7 3 2
3 9 9 9 4
4 2 2 2 4
[5 rows x 4 columns]
>>> x[x[3] == 4]
0 1 2 3
0 1 2 3 4
3 9 9 9 4
4 2 2 2 4
[3 rows x 4 columns]
In your case condition would be on timestamp column. x[x[3] == 4] means that get only those rows for which column '3' has a value of 4.

Indexing pandas dataframe to return first data point from each day

In Pandas, I have a DataFrame with datetimes in a column (not the index), which span several days, and are at irregular time intervals (i.e. not periodic). I want to return the first value from each day. So if my datetime column looked like:
2013-01-01 01:00
2013-01-01 05:00
2013-01-01 14:00
2013-01-02 01:00
2013-01-02 05:00
2013-01-04 14:00
The command I'm looking for would return the dataframe columns for the following indexes:
2013-01-01 01:00
2013-01-02 01:00
2013-01-04 14:00
With this setup:
import pandas as pd
data = '''\
2013-01-01 01:00
2013-01-01 05:00
2013-01-01 14:00
2013-01-02 01:00
2013-01-02 05:00
2013-01-04 14:00'''
dates = pd.to_datetime(data.splitlines())
df = pd.DataFrame({'date': dates, 'val': range(len(dates))})
>>> df
date val
0 2013-01-01 01:00:00 0
1 2013-01-01 05:00:00 1
2 2013-01-01 14:00:00 2
3 2013-01-02 01:00:00 3
4 2013-01-02 05:00:00 4
5 2013-01-04 14:00:00 5
You can produce the desired DataFrame using groupby and agg:
grouped = df.groupby([d.strftime('%Y%m%d') for d in df['date']])
newdf = grouped.agg('first')
print(newdf)
yields
date val
20130101 2013-01-01 01:00:00 0
20130102 2013-01-02 01:00:00 3
20130104 2013-01-04 14:00:00 5

Categories