Python: Delete specific timestamp index row (independently of date)

Python: Delete specific timestamp index row (independently of date) - python

I have a DataFrame with this specific timestamps index:
2011-01-07 09:30:00
2011-01-07 09:35:00
2011-01-07 09:40:00
...
2011-01-08 09:30:00
2011-01-08 09:35:00
2011-01-08 09:40:00
...
2011-01-09 09:30:00
2011-01-09 09:35:00
2011-01-09 09:40:00
Without going through some kind of loop, is there a fast way to delete every row with the time 09:30:00 independently of the date?

Construct a test frame
In [28]: df = DataFrame(np.random.randn(400,1),index=date_range('20130101',periods=400,freq='15T'))
In [29]: df = df.take(df.index.indexer_between_time('9:00','10:00'))
In [30]: df
Out[30]:
0
2013-01-01 09:00:00 -1.452507
2013-01-01 09:15:00 -0.244847
2013-01-01 09:30:00 -0.654370
2013-01-01 09:45:00 -0.689975
2013-01-01 10:00:00 -1.506261
2013-01-02 09:00:00 -0.096923
2013-01-02 09:15:00 -1.371506
2013-01-02 09:30:00 1.481053
2013-01-02 09:45:00 0.327030
2013-01-02 10:00:00 1.614000
2013-01-03 09:00:00 -1.313668
2013-01-03 09:15:00 0.563914
2013-01-03 09:30:00 -0.117773
2013-01-03 09:45:00 0.309642
2013-01-03 10:00:00 -0.386824
2013-01-04 09:00:00 -1.245194
2013-01-04 09:15:00 0.930746
2013-01-04 09:30:00 1.088279
2013-01-04 09:45:00 -0.927087
2013-01-04 10:00:00 -1.098625
[20 rows x 1 columns]
The indexer_between_time returns the indexes that we want to remove, so just remove them from the original index (this is what an index - does).
In [31]: df.reindex(df.index-df.index.take(df.index.indexer_between_time('9:30:00','9:30:00')))
Out[31]:
0
2013-01-01 09:00:00 -1.452507
2013-01-01 09:15:00 -0.244847
2013-01-01 09:45:00 -0.689975
2013-01-01 10:00:00 -1.506261
2013-01-02 09:00:00 -0.096923
2013-01-02 09:15:00 -1.371506
2013-01-02 09:45:00 0.327030
2013-01-02 10:00:00 1.614000
2013-01-03 09:00:00 -1.313668
2013-01-03 09:15:00 0.563914
2013-01-03 09:45:00 0.309642
2013-01-03 10:00:00 -0.386824
2013-01-04 09:00:00 -1.245194
2013-01-04 09:15:00 0.930746
2013-01-04 09:45:00 -0.927087
2013-01-04 10:00:00 -1.098625
[16 rows x 1 columns]

You need to do something like -
>>> x = pd.DataFrame([[1,2,3,4],[3,3,3,3],[8,7,3,2],[9,9,9,4],[2,2,2,4]])
>>> x
0 1 2 3
0 1 2 3 4
1 3 3 3 3
2 8 7 3 2
3 9 9 9 4
4 2 2 2 4
[5 rows x 4 columns]
>>> x[x[3] == 4]
0 1 2 3
0 1 2 3 4
3 9 9 9 4
4 2 2 2 4
[3 rows x 4 columns]
In your case condition would be on timestamp column. x[x[3] == 4] means that get only those rows for which column '3' has a value of 4.

Related

Select sub_Dataframe with postive element and return index in sub_List

I have a dataframe looking like this.
A
2013-01-05 00:00:00 0
2013-01-05 01:00:00 0
2013-01-05 02:00:00 5
2013-01-05 03:00:00 20
2013-01-05 04:00:00 10
2013-01-05 05:00:00 0
2013-01-05 06:00:00 0
2013-01-05 07:00:00 3
2013-01-05 07:00:00 6
I tried to select sub dataframes with positive values and extract their indexes
List= df[df['A']>0].index.tolist()
Index of first and last positive element of each sub dataframe, put the sub list in list: for this dataframe [[5,10],[3,6]] and return their indexes
Desired output:List[[ 2013-01-05 02:00:00,2013-01-05 04:00:00],[2013-01-05 07:00:00,2013-01-05 08:00:00]]

You could try:
idx_list = (
df
.assign(
group=df["A"].gt(0).diff().fillna(False).cumsum(), idx=df.index
)[df["A"].gt(0)]
.groupby("group").agg({"idx": lambda col: [col.iat[0], col.iat[-1]]})
.idx.to_list()
)
Result with
df =
A
2013-01-05 00:00:00 0
2013-01-05 01:00:00 0
2013-01-05 02:00:00 5
2013-01-05 03:00:00 20
2013-01-05 04:00:00 10
2013-01-05 05:00:00 0
2013-01-05 06:00:00 0
2013-01-05 07:00:00 3
2013-01-05 08:00:00 6
is
[[Timestamp('2013-01-05 02:00:00'), Timestamp('2013-01-05 04:00:00')],
[Timestamp('2013-01-05 07:00:00'), Timestamp('2013-01-05 08:00:00')]]

Change index for the column name and then extract the first and the last element from a list:
my_list = df[df['A']>0]['A'].to_list()
my_list= [my_list[0],my_list[-1]]

Reverse position of entries in pandas dataframe based on condition

Here I have an extract from my pandas dataframe which is survey data with two datetime fields. It appears that some of the start times and end times were filled in the wrong position in the survey. Here is an example from my dataframe. The start and end time in the 8th row, I suspect were entered the wrong way round.
Just to give context, I generated the third column like this:
df_time['trip_duration'] = df_time['tripEnd_time'] - df_time['tripStart_time']
The three columns are in timedelta64 format.
Here is the top of my dataframe:
tripStart_time tripEnd_time trip_duration
1 22:30:00 23:15:00 00:45:00
2 11:00:00 11:30:00 00:30:00
3 09:00:00 09:15:00 00:15:00
4 13:30:00 14:25:00 00:55:00
5 09:00:00 10:15:00 01:15:00
6 12:00:00 12:15:00 00:15:00
7 08:00:00 08:30:00 00:30:00
8 11:00:00 09:15:00 -1 days +22:15:00
9 14:00:00 14:30:00 00:30:00
10 14:55:00 15:20:00 00:25:00
What I am trying to do is, loop through these two columns, and for each time 'tripEnd_time' is less than 'tripStart_time' swap the positions of these two entries. So in the case of row 8 above, I would make tripStart_time = tripEnd_time and tripEnd_time = tripStart_time.
I am not quite sure the best way to approach this. Should I use nested for loop where i compare each entry in the two columns?
Thanks

Use Series.abs:
df_time['trip_duration'] = (df_time['tripEnd_time'] - df_time['tripStart_time']).abs()
print (df_time)
1 22:30:00 23:15:00 00:45:00
2 11:00:00 11:30:00 00:30:00
3 09:00:00 09:15:00 00:15:00
4 13:30:00 14:25:00 00:55:00
5 09:00:00 10:15:00 01:15:00
6 12:00:00 12:15:00 00:15:00
7 08:00:00 08:30:00 00:30:00
8 11:00:00 09:15:00 01:45:00
9 14:00:00 14:30:00 00:30:00
10 14:55:00 15:20:00 00:25:00
What is same like:
a = df_time['tripEnd_time'] - df_time['tripStart_time']
b = df_time['tripStart_time'] - df_time['tripEnd_time']
mask = df_time['tripEnd_time'] > df_time['tripStart_time']
df_time['trip_duration'] = np.where(mask, a, b)
print (df_time)
tripStart_time tripEnd_time trip_duration
1 22:30:00 23:15:00 00:45:00
2 11:00:00 11:30:00 00:30:00
3 09:00:00 09:15:00 00:15:00
4 13:30:00 14:25:00 00:55:00
5 09:00:00 10:15:00 01:15:00
6 12:00:00 12:15:00 00:15:00
7 08:00:00 08:30:00 00:30:00
8 11:00:00 09:15:00 01:45:00
9 14:00:00 14:30:00 00:30:00
10 14:55:00 15:20:00 00:25:00

You can switch column values on selected rows:
df_time.loc[df_time['tripEnd_time'] < df_time['tripStart_time'],
['tripStart_time', 'tripEnd_time']] = df_time.loc[
df_time['tripEnd_time'] < df_time['tripStart_time'],
['tripEnd_time', 'tripStart_time']].values

Python pandas time series w/ hierarchical indexing and rolling/ shifting

Struggling with pandas' rolling and shifting concept. There are many good suggestions including in this forum but I failed miserably to apply these to my scenario.
Now I use traditional looping over the time series but ugh, it took like 8 hours to iterate over 150,000 rows which is about 3 days of data for all tickers. Got 2 months data to process it probably won't finish after I come back from a sabbatical, not mentioning risk of electricity cut off after which I'd have to start over again this time no sabbatical while waiting.
I have the following 15 min stock price time series (Hierarchical index on datetime(timestamp) and ticker, the only original column is closePrice):
closePrice
datetime ticker
2014-02-04 09:15:00 AAPL xxx
EQIX xxx
FB xxx
GOOG xxx
MSFT xxx
2014-02-04 09:30:00 AAPL xxx
EQIX xxx
FB xxx
GOOG xxx
MSFT xxx
2014-02-04 09:45:00 AAPL xxx
EQIX xxx
FB xxx
GOOG xxx
MSFT xxx
I need to add two columns:
12sma, 12 days moving average. Having searched SO for hours the best suggestion would be to use rolling_mean, so I tried. But it didn't work given my TS structure i.e. it works top down the first MA is calculated based on the first 12 rows regardless of different ticker values. How do I make it average based on the index i.e. first datetime then ticker so I get MA for say AAPL? Currently it does (AAPL+EQIX+FB+GOOG+MSFT+AAPL...up to 12th row) / 12
Once I got the 12sma column, I need 12ema column, 12 days exponential MA. For the calculation, the first value in the time series for each ticker would just copy 12sma value from the same row. Subsequently, I'd need closePrice from the same row and 12ema from the previous row i.e. past 15 min. I did a long research seems like the solution would be a combination of rolling and shifting but I can't figure out how to put it together.
Any help I'd be grateful.
Thanks.
EDIT:
Thanks to Jeff's tips, after swapping and sorting ix level I am able to get the 12sma right with rolling_mean() and with a effort managed to insert the first 12ema value copied from 12sma at the same timestamp:
close 12sma 12ema
sec_code datetime
AAPL 2014-02-05 11:45:00 113.0 NaN NaN
2014-02-05 12:00:00 113.2 NaN NaN
2014-02-05 13:15:00 112.9 NaN NaN
2014-02-05 13:30:00 113.2 NaN NaN
2014-02-05 13:45:00 113.0 NaN NaN
2014-02-05 14:00:00 113.1 NaN NaN
2014-02-05 14:15:00 113.3 NaN NaN
2014-02-05 14:30:00 113.3 NaN NaN
2014-02-05 14:45:00 113.3 NaN NaN
2014-02-05 15:00:00 113.2 NaN NaN
2014-02-05 15:15:00 113.2 NaN NaN
2014-02-05 15:30:00 113.3 113.16 113.16
2014-02-05 15:45:00 113.3 113.19 NaN
2014-02-05 16:00:00 113.2 113.19 NaN
2014-02-06 09:45:00 112.6 113.16 NaN
2014-02-06 10:00:00 113.5 113.19 NaN
2014-02-06 10:15:00 113.8 113.25 NaN
2014-02-06 10:30:00 113.5 113.29 NaN
2014-02-06 10:45:00 113.7 113.32 NaN
2014-02-06 11:00:00 113.5 113.34 Nan
I understand pandas has pandas.stats.moments.ewma but I prefer to use a formula I got from a book which needs close price 'at the moment' and 12ema from previous row.
So, I tried to fill 12ema column from Feb 5, 15:45 and onward. I tried apply() with a function but shift gave an error:
def f12ema(x):
K = 2 / (12 + 1)
return x['price_nom'] * K + x['12ema'].shift(-1) * (1-K)
df1.apply(f12ema, axis=1)
AttributeError: ("'numpy.float64' object has no attribute 'shift'", u'occurred at index 2014-02-05 11:45:00')
Another possibility that crossed my mind is rolling_appy() but it is beyond my knowledge.

Create a date range inclusive of the times you want
In [47]: rng = date_range('20130102 09:30:00','20130105 16:00:00',freq='15T')
In [48]: rng = rng.take(rng.indexer_between_time('09:30:00','16:00:00'))
In [49]: rng
Out[49]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-02 09:30:00, ..., 2013-01-05 16:00:00]
Length: 108, Freq: None, Timezone: None
Create a frame similar to yours (2000 tickers x dates)
In [50]: df = DataFrame(np.random.randn(len(rng)*2000,1),columns=['close'],index=MultiIndex.from_product([rng,range(2000)],names=['date','ticker']))
Reorder the levels so that its ticker x date for the index, SORT IT!!!!
In [51]: df = df.swaplevel('ticker','date').sortlevel()
In [52]: df
Out[52]:
close
ticker date
0 2013-01-02 09:30:00 0.840767
2013-01-02 09:45:00 1.808101
2013-01-02 10:00:00 -0.718354
2013-01-02 10:15:00 -0.484502
2013-01-02 10:30:00 0.563363
2013-01-02 10:45:00 0.553920
2013-01-02 11:00:00 1.266992
2013-01-02 11:15:00 -0.641117
2013-01-02 11:30:00 -0.574673
2013-01-02 11:45:00 0.861825
2013-01-02 12:00:00 -1.562111
2013-01-02 12:15:00 -0.072966
2013-01-02 12:30:00 0.673079
2013-01-02 12:45:00 0.766105
2013-01-02 13:00:00 0.086202
2013-01-02 13:15:00 0.949205
2013-01-02 13:30:00 -0.381194
2013-01-02 13:45:00 0.316813
2013-01-02 14:00:00 -0.620176
2013-01-02 14:15:00 -0.193126
2013-01-02 14:30:00 -1.552111
2013-01-02 14:45:00 1.724429
2013-01-02 15:00:00 -0.092393
2013-01-02 15:15:00 0.197763
2013-01-02 15:30:00 0.064541
2013-01-02 15:45:00 -1.574853
2013-01-02 16:00:00 -1.023979
2013-01-03 09:30:00 -0.079349
2013-01-03 09:45:00 -0.749285
2013-01-03 10:00:00 0.652721
2013-01-03 10:15:00 -0.818152
2013-01-03 10:30:00 0.674068
2013-01-03 10:45:00 2.302714
2013-01-03 11:00:00 0.614686
...
[216000 rows x 1 columns]
Groupby the ticker. Return a DataFrame for each ticker that is the application of rolling_mean and ewma. Note that are many options for controlling this, e.g windowing, you could make it not wrap around days, etc.
In [53]: df.groupby(level='ticker')['close'].apply(lambda x: concat({ 'spma' : pd.rolling_mean(x,3), 'ema' : pd.ewma(x,3) }, axis=1))
Out[53]:
ema spma
ticker date
0 2013-01-02 09:30:00 0.840767 NaN
2013-01-02 09:45:00 1.393529 NaN
2013-01-02 10:00:00 0.480282 0.643504
2013-01-02 10:15:00 0.127447 0.201748
2013-01-02 10:30:00 0.270334 -0.213164
2013-01-02 10:45:00 0.356580 0.210927
2013-01-02 11:00:00 0.619245 0.794758
2013-01-02 11:15:00 0.269100 0.393265
2013-01-02 11:30:00 0.041032 0.017067
2013-01-02 11:45:00 0.258476 -0.117988
2013-01-02 12:00:00 -0.216742 -0.424986
2013-01-02 12:15:00 -0.179622 -0.257750
2013-01-02 12:30:00 0.038741 -0.320666
2013-01-02 12:45:00 0.223881 0.455406
2013-01-02 13:00:00 0.188995 0.508462
2013-01-02 13:15:00 0.380972 0.600504
2013-01-02 13:30:00 0.188987 0.218071
2013-01-02 13:45:00 0.221125 0.294942
2013-01-02 14:00:00 0.009907 -0.228185
2013-01-02 14:15:00 -0.041013 -0.165496
2013-01-02 14:30:00 -0.419688 -0.788471
2013-01-02 14:45:00 0.117299 -0.006936
2013-01-04 10:00:00 -0.060415 0.341013
2013-01-04 10:15:00 0.074068 0.604611
2013-01-04 10:30:00 -0.108502 0.440256
2013-01-04 10:45:00 -0.514229 -0.636702
... ...
[216000 rows x 2 columns]
Pretty good perf as its essentially looping over the tickers.
In [54]: %timeit df.groupby(level='ticker')['close'].apply(lambda x: concat({ 'spma' : pd.rolling_mean(x,3), 'ema' : pd.ewma(x,3) }, axis=1))
1 loops, best of 3: 2.1 s per loop

Temporal Binning in Pandas

I would like to perform something similar to an SQL groupby operation or R's aggregate in Pandas. I have a bunch of rows with irregular timestamps, I would like to create temporal bins and count the number of rows falling into each bin. I can't quite see how to use resample to do this
Example Rows
Time, Val
05.33, XYZ
05.45, ABC
07.13, DEF
Example Output
05.00-06.00, 2
06.00-07.00, 0
07.00-08.00, 1

If you are indexing on another value, you can use a groupby statement on the timestamp.
In [1]: dft = pd.DataFrame({'A' : ['spam', 'eggs', 'spam', 'eggs'] * 6,
'B' : np.random.randn(24),
'C' : [np.random.choice(pd.date_range(datetime.datetime(2013,1,1,0,0,0),datetime.datetime(2013,1,2,0,0,0),freq='T')) for i in range(24)]})
In [2]: dft['B'].groupby([dft['C'].apply(lambda x:x.hour)]).agg(pd.Series.nunique)
Out[2]:
C
2 1
4 1
6 1
7 1
9 1
10 2
11 1
12 4
14 1
15 2
16 1
18 3
19 1
20 1
21 1
22 1
23 1
dtype: float64
If you're indexing on timestamps, then you can use resample.
In [3]: dft2 = pd.DataFrame({'A' : ['spam', 'eggs', 'spam', 'eggs'] * 6,
'B' : np.random.randn(24)},
index = [np.random.choice(pd.date_range(datetime.datetime(2013,1,1,0,0,0),datetime.datetime(2013,1,2,0,0,0),freq='T')) for i in range(24)])
In [4]: dft2.resample('H',how=pd.Series.nunique)
Out[4]:
A B
2013-01-01 01:00:00 1 1
2013-01-01 02:00:00 0 0
2013-01-01 03:00:00 0 0
2013-01-01 04:00:00 0 0
2013-01-01 05:00:00 2 2
2013-01-01 06:00:00 2 3
2013-01-01 07:00:00 1 2
2013-01-01 08:00:00 2 2
2013-01-01 09:00:00 1 1
2013-01-01 10:00:00 2 3
2013-01-01 11:00:00 1 1
2013-01-01 12:00:00 1 2
2013-01-01 13:00:00 0 0
2013-01-01 14:00:00 1 1
2013-01-01 15:00:00 0 0
2013-01-01 16:00:00 1 1
2013-01-01 17:00:00 1 2
2013-01-01 18:00:00 0 0
2013-01-01 19:00:00 0 0
2013-01-01 20:00:00 2 2
2013-01-01 21:00:00 1 1

Indexing pandas dataframe to return first data point from each day

In Pandas, I have a DataFrame with datetimes in a column (not the index), which span several days, and are at irregular time intervals (i.e. not periodic). I want to return the first value from each day. So if my datetime column looked like:
2013-01-01 01:00
2013-01-01 05:00
2013-01-01 14:00
2013-01-02 01:00
2013-01-02 05:00
2013-01-04 14:00
The command I'm looking for would return the dataframe columns for the following indexes:
2013-01-01 01:00
2013-01-02 01:00
2013-01-04 14:00

With this setup:
import pandas as pd
data = '''\
2013-01-01 01:00
2013-01-01 05:00
2013-01-01 14:00
2013-01-02 01:00
2013-01-02 05:00
2013-01-04 14:00'''
dates = pd.to_datetime(data.splitlines())
df = pd.DataFrame({'date': dates, 'val': range(len(dates))})
>>> df
date val
0 2013-01-01 01:00:00 0
1 2013-01-01 05:00:00 1
2 2013-01-01 14:00:00 2
3 2013-01-02 01:00:00 3
4 2013-01-02 05:00:00 4
5 2013-01-04 14:00:00 5
You can produce the desired DataFrame using groupby and agg:
grouped = df.groupby([d.strftime('%Y%m%d') for d in df['date']])
newdf = grouped.agg('first')
print(newdf)
yields
date val
20130101 2013-01-01 01:00:00 0
20130102 2013-01-02 01:00:00 3
20130104 2013-01-04 14:00:00 5

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.