How to index out open and close in pandas datetime dataframe? - python

Okay so I have a csv with minute data for the S&P 500 index for 2020, and I am looking how to index out only the close and open for 9:30 and 4:00 only. In essence I just want what the market open and close was. So far the code is:
import pandas as pd
import datetime as dt
import numpy as np
d = pd.read_csv('/Volumes/Seagate Portable/usindex_2020_all_tickers_awvbxk9/SPX_2020_2020.txt')
d.columns = ['Dates', 'Open', 'High', 'Low', 'Close']
d.drop(['High', 'Low'], axis=1, inplace=True)
d.set_index('Dates', inplace=True)
d.head()
It wont let me share the csv file but this is what the output looks like:
Open Close
Dates
2020-01-02 09:31:00 3247.19 3245.22
2020-01-02 09:32:00 3245.07 3244.66
2020-01-02 09:33:00 3244.89 3247.61
2020-01-02 09:34:00 3247.38 3246.92
2020-01-02 09:35:00 3246.89 3249.09
I have tried using loc and dt.time, which I am assmuning is the right way to code I just cannot think of the exact code to index out these 2 times. Any ideas? Thank you!

If the .dt extractor is used on the 'Dates' column (d.Dates.dt.time[0]), the .time component is datetime.time(9, 30), therefore d.Dates.dt.time == dtime(9, 30) must be used for the Boolean match, and not d.Dates.dt.time == '09:30:00'
import pandas as pd
from datetime import time as dtime
# test dataframe
d = pd.DataFrame({'Dates': ['2020-01-02 09:30:00', '2020-01-02 09:31:00', '2020-01-02 09:32:00', '2020-01-02 09:33:00', '2020-01-02 09:34:00', '2020-01-02 09:35:00', '2020-01-02 16:00:00'], 'Open': [3247.19, 3247.19, 3245.07, 3244.89, 3247.38, 3246.89, 3247.19], 'Close': [3245.22, 3245.22, 3244.66, 3247.61, 3246.92, 3249.09, 3245.22]})
# display(d)
Dates Open Close
0 2020-01-02 09:30:00 3247.19 3245.22
1 2020-01-02 09:31:00 3247.19 3245.22
2 2020-01-02 09:32:00 3245.07 3244.66
3 2020-01-02 09:33:00 3244.89 3247.61
4 2020-01-02 09:34:00 3247.38 3246.92
5 2020-01-02 09:35:00 3246.89 3249.09
6 2020-01-02 16:00:00 3247.19 3245.22
# verify Dates is a datetime format
d.Dates = pd.to_datetime(d.Dates)
# use Boolean selection for 9:30 and 16:00 (4pm)
d = d[(d.Dates.dt.time == dtime(9, 30)) | (d.Dates.dt.time == dtime(16, 0))].copy()
# set the index
d.set_index('Dates', inplace=True)
# display(d)
Open Close
Dates
2020-01-02 09:30:00 3247.19 3245.22
2020-01-02 16:00:00 3247.19 3245.22

Try:
import pandas as pd
# create dummy daterange
date_range = pd.DatetimeIndex(pd.date_range("00:00", "23:59", freq='1min'))
# create df with enumerated column as data, and with daterange(DatetimeIndex) as index
df = pd.DataFrame(data=[i for i, d in enumerate(date_range)], index=date_range)
# boolean index using strings
four_and_nine = df[(df.index == '16:00:00') | (df.index == '21:00:00')]
print(four_and_nine)
0
2021-01-01 16:00:00 960
2021-01-01 21:00:00 1260
Pandas is pretty smart with comparing strings to actual datetimes(DatetimeIndex in this case).
Above is selecting top of the hour. If you wanted all minutes/seconds within specific hours, use boolean index like: df[(df.index.hour == 4) | (df.index.hour == 9)]

Related

Pandas: how to map one dataframe onto another with missing data?

I'm sure an easy command exists to do this in pandas, but for the life of me I can't figure it out.
I have two dataframes, the first is an **ideal **stock market timeline (times where I expect data to exist). The second dataframe is the **actual **data, with gaps. I need to map one to the other, and fill in the gaps with NaN.
First DataFrame: (an ideal timeline)
datetime
0 2005-01-03 10:00:00
1 2005-01-03 11:00:00
2 2005-01-03 12:00:00
3 2005-01-03 13:00:00
4 2005-01-03 14:00:00
Second DataFrame: (actual data with missing value at time 12:00:00)
datetime open high low close volume
1 2005-01-03 10:00:00 15.1118 15.1745 14.7478 14.8294 586463
2 2005-01-03 11:00:00 14.8294 14.9737 14.7792 14.9423 344888
3 2005-01-03 13:00:00 15.0490 15.0929 14.9549 14.9612 343767
4 2005-01-03 14:00:00 14.9674 15.0616 14.9674 15.0051 364739
I want the finished product to be:
datetime open high low close volume
1 2005-01-03 10:00:00 15.1118 15.1745 14.7478 14.8294 586463
2 2005-01-03 11:00:00 14.8294 14.9737 14.7792 14.9423 344888
3 2005-01-03 12:00:00 Nan NaN NaN NaN NaN
4 2005-01-03 13:00:00 15.0490 15.0929 14.9549 14.9612 343767
5 2005-01-03 14:00:00 14.9674 15.0616 14.9674 15.0051 364739
where the dataframe's datetime column is now the ideal timeseries, and missing points are NaN
I've tried to study the documentation on this but I'm still a noob and I can't figure this out. Any suggestions?
You can use merge function on pandas. Merge the two dataframes on datatime and use outer on how parameter. Outer uses union of keys from both dataframes.
sample code:
import pandas as pd
# First DataFrame
df1 = pd.DataFrame({'datetime': ['2005-01-03 10:00:00', '2005-01-03 11:00:00', '2005-01-03 12:00:00', '2005-01-03 13:00:00']})
df2 = pd.DataFrame({'datetime': ['2005-01-03 10:00:00', '2005-01-03 11:00:00', '2005-01-03 13:00:00', '2005-01-03 14:00:00'],
'open': [15.1118, 14.8294,15.0490,14.9674],
'high': [15.1745, 14.9737,15.0929,15.0616],
'low': [14.7478,14.7792,14.9549,14.9674],
'close': [14.8294,14.9423,14.9612,15.0051],
'volume': [586463,344888,343767,364739]})
merged_df = pd.merge(df1, df2,how='outer' ,on = 'datetime')
merged_df
output:
Reference:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
IIUC, it is quite simple. You can prepare the desired index (ix below), then df.reindex(ix) is the result you are looking for (assuming your df has 'datetime' as index --if not, make it so with df = df.set_index('datetime')). If the index is that of another DataFrame, then use that instead of making ix from scratch. If you just want to make sure the index is at hourly frequency and complete, then no need for ix: just df.resample('H').asfreq() is the value you want.
Note: here, there is no need to use pd.merge(). It is overkill for this problem and many times slower in this case.
Example (using your df):
start, end = '2005-01-03 10:00:00', '2005-01-03 14:00:00'
ix = pd.date_range(start, end, freq='H')
>>> df.reindex(ix)
open high low close volume
2005-01-03 10:00:00 15.1118 15.1745 14.7478 14.8294 586463.0
2005-01-03 11:00:00 14.8294 14.9737 14.7792 14.9423 344888.0
2005-01-03 12:00:00 NaN NaN NaN NaN NaN
2005-01-03 13:00:00 15.0490 15.0929 14.9549 14.9612 343767.0
2005-01-03 14:00:00 14.9674 15.0616 14.9674 15.0051 364739.0
With this specific ix, you get the same with simply:
>>> df.resample('H').asfreq()
# same as above
Note: the initial df is the one you use as example, with 'datetime' set as index:
df = pd.DataFrame({
'datetime': pd.to_datetime([
'2005-01-03 10:00:00', '2005-01-03 11:00:00', '2005-01-03 13:00:00', '2005-01-03 14:00:00']),
'open': [15.1118, 14.8294, 15.049, 14.9674],
'high': [15.1745, 14.9737, 15.0929, 15.0616],
'low': [14.7478, 14.7792, 14.9549, 14.9674],
'close': [14.8294, 14.9423, 14.9612, 15.0051],
'volume': [586463, 344888, 343767, 364739],
}).set_index('datetime')
>>> df
open high low close volume
datetime
2005-01-03 10:00:00 15.1118 15.1745 14.7478 14.8294 586463
2005-01-03 11:00:00 14.8294 14.9737 14.7792 14.9423 344888
2005-01-03 13:00:00 15.0490 15.0929 14.9549 14.9612 343767
2005-01-03 14:00:00 14.9674 15.0616 14.9674 15.0051 364739
Speed
Let's see what happens at scale, and compare solutions:
start, end = '2005', '2010'
ix = pd.date_range(start, end, freq='H', inclusive='left')
cols = ['open', 'high', 'low', 'close', 'volume']
np.random.seed(0) # reproducible example
fulldf = pd.DataFrame(np.random.uniform(size=(len(ix), len(cols))), index=ix, columns=cols)
df = fulldf.sample(frac=.9).sort_index().copy(deep=True)
>>> df.shape
(39442, 5)
Now:
t0 = %timeit -o df.reindex(ix)
# 1.57 ms ± 42.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
By contrast:
df1 = pd.DataFrame({'datetime': ix})
df2 = df.rename_axis(index='datetime').reset_index()
t1 = %timeit -o pd.merge(df1, df2, how='outer', on = 'datetime')
# 14.3 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> t1.best / t0.best
9.312707763833396
Conclusion: reindex() is 9x faster than merge() in this example.

Pandas change time values based on condition

I have a dataframe:
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
I would like to convert the time based on conditions: if the hour is less than 9, I want to set it to 9 and if the hour is more than 17, I need to set it to 17.
I tried this approach:
df['time'] = np.where(((df['time'].dt.hour < 9) & (df['time'].dt.hour != 0)), dt.time(9, 00))
I am getting an error: Can only use .dt. accesor with datetimelike values.
Can anyone please help me with this? Thanks.
Here's a way to do what your question asks:
df.time = pd.to_datetime(df.time)
df.loc[df.time.dt.hour < 9, 'time'] = (df.time.astype('int64') + (9 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.loc[df.time.dt.hour > 17, 'time'] = (df.time.astype('int64') + (17 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
Input:
time
0 2022-06-06 08:45:00
1 2022-06-06 09:30:00
2 2022-06-06 18:00:00
3 2022-06-06 15:00:00
Output:
time
0 2022-06-06 09:45:00
1 2022-06-06 09:30:00
2 2022-06-06 17:00:00
3 2022-06-06 15:00:00
UPDATE:
Here's alternative code to try to address OP's error as described in the comments:
import pandas as pd
import datetime
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
print('', 'df loaded as strings:', df, sep='\n')
df.time = pd.to_datetime(df.time, format='%H:%M:%S')
print('', 'df converted to datetime by pd.to_datetime():', df, sep='\n')
df.loc[df.time.dt.hour < 9, 'time'] = (df.time.astype('int64') + (9 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.loc[df.time.dt.hour > 17, 'time'] = (df.time.astype('int64') + (17 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.time = [time.time() for time in pd.to_datetime(df.time)]
print('', 'df with time column adjusted to have hour between 9 and 17, converted to type "time":', df, sep='\n')
Output:
df loaded as strings:
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
df converted to datetime by pd.to_datetime():
time
0 1900-01-01 08:45:00
1 1900-01-01 09:30:00
2 1900-01-01 18:00:00
3 1900-01-01 15:00:00
df with time column adjusted to have hour between 9 and 17, converted to type "time":
time
0 09:45:00
1 09:30:00
2 17:00:00
3 15:00:00
UPDATE #2:
To not just change the hour for out-of-window times, but to simply apply 9:00 and 17:00 as min and max times, respectively (see OP's comment on this), you can do this:
df.loc[df['time'].dt.hour < 9, 'time'] = pd.to_datetime(pd.DataFrame({
'year':df['time'].dt.year, 'month':df['time'].dt.month, 'day':df['time'].dt.day,
'hour':[9]*len(df.index)}))
df.loc[df['time'].dt.hour > 17, 'time'] = pd.to_datetime(pd.DataFrame({
'year':df['time'].dt.year, 'month':df['time'].dt.month, 'day':df['time'].dt.day,
'hour':[17]*len(df.index)}))
df['time'] = [time.time() for time in pd.to_datetime(df['time'])]
Since your 'time' column contains strings they can kept as strings and assign new string values where appropriate. To filter for your criteria it is convenient to: create datetime Series from the 'time' column, create boolean Series by comparing the datetime Series with your criteria, use the boolean Series to filter the rows which need to be changed.
Your data:
import numpy as np
import pandas as pd
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
print(df.to_string())
>>>
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
Convert to datetime, make boolean Series with your criteria
dts = pd.to_datetime(df['time'])
lt_nine = dts.dt.hour < 9
gt_seventeen = (dts.dt.hour >= 17)
print(lt_nine)
print(gt_seventeen)
>>>
0 True
1 False
2 False
3 False
Name: time, dtype: bool
0 False
1 False
2 True
3 False
Name: time, dtype: bool
Use the boolean series to assign a new value:
df.loc[lt_nine,'time'] = '09:00:00'
df.loc[gt_seventeen,'time'] = '17:00:00'
print(df.to_string())
>>>
time
0 09:00:00
1 09:30:00
2 17:00:00
3 15:00:00
Or just stick with strings altogether and create the boolean Series using regex patterns and .str.match.
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00','07:22:00','22:02:06']}
dg = pd.DataFrame(data)
print(dg.to_string())
>>>
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
4 07:22:00
5 22:02:06
# regex patterns
pattern_lt_nine = '^00|01|02|03|04|05|06|07|08'
pattern_gt_seventeen = '^17|18|19|20|21|22|23'
Make boolean Series and assign new values
gt_seventeen = dg['time'].str.match(pattern_gt_seventeen)
lt_nine = dg['time'].str.match(pattern_lt_nine)
dg.loc[lt_nine,'time'] = '09:00:00'
dg.loc[gt_seventeen,'time'] = '17:00:00'
print(dg.to_string())
>>>
time
0 09:00:00
1 09:30:00
2 17:00:00
3 15:00:00
4 09:00:00
5 17:00:00
Time series / date functionality
Working with text data

Extend datetimeindex to previous times in pandas

MRE:
idx = pd.date_range('2015-07-03 08:00:00', periods=30,
freq='H')
data = np.random.randint(1, 100, size=len(idx))
df = pd.DataFrame({'index':idx, 'col':data})
df.set_index("index", inplace=True)
which looks like:
col
index
2015-07-03 08:00:00 96
2015-07-03 09:00:00 79
2015-07-03 10:00:00 15
2015-07-03 11:00:00 2
2015-07-03 12:00:00 84
2015-07-03 13:00:00 86
2015-07-03 14:00:00 5
.
.
.
Note that dataframe contain multiple days. Since frequency is in hours, starting from 07/03 08:00:00 it will contain hourly date.
I want to get all data from 05:00:00 including day 07/03 even if it will contain value 0 in "col" column.
I want to extend it backwards so it starts from 05:00:00.
No I just can't start from 05:00:00 since I already have dataframe that starts from 08:00:00. I am trying to keep everything same but add 3 rows in the beginning to include 05:00:00, 06:00:00, and 07:00:00
The reindex method is handy for changing the index values:
idx = pd.date_range('2015-07-03 08:00:00', periods=30, freq='H')
data = np.random.randint(1, 100, size=len(idx))
# use the index param to set index or you might lose the freq
df = pd.DataFrame({'col':data}, index=idx)
# reindex with a new index
start = df.tshift(-3).index[0]
end = df.index[-1]
new_index = pd.date_range(start, end, freq='H')
new_df = df.reindex(new_index)
resample is also very useful for date indices
Just change the time from 08:00:00 to 05:00:00 in your code and create 3 more rows and update this dataframe to the existing one.
idx1 = pd.date_range('2015-07-03 05:00:00', periods=3,freq='H')
df1 = pd.DataFrame({'index': idx1 ,'col':np.random.randint(1,100,size = 3)})
df1.set_index('index',inplace=True)
df = df1.append(df)
print(df)
Add this snippet to your code...

Pandas read_csv with different date parsers

I have a csv-file with time series data, the first column is the date in the format %Y:%m:%d and the second column is the intraday time in the format '%H:%M:%S'. I would like to import this csv-file into a multiindex dataframe or panel object.
With this code, it already works:
_file_data = pd.read_csv(_file,
sep=",",
header=0,
index_col=['Date', 'Time'],
thousands="'",
parse_dates=True,
skipinitialspace=True
)
It returns the data in the following format:
Date Time Volume
2016-01-04 2018-04-25 09:01:29 53645
2018-04-25 10:01:29 123
2018-04-25 10:01:29 1345
....
2016-01-05 2018-04-25 10:01:29 123
2018-04-25 12:01:29 213
2018-04-25 10:01:29 123
1st question:
I would like to show the second index as a pure time-object not datetime. To do that, I have to declare two different date-pasers in the read_csv function, but I can't figure out how. What is the "best" way to do that?
2nd question:
After I created the Dataframe, I converted it to a panel-object. Would you recommend doing that? Is the panel-object the better choice for such a data structure? What are the benefits (drawbacks) of a panel-object?
1st question:
You can create multiple converters and define parsers in dictionary:
import pandas as pd
temp=u"""Date,Time,Volume
2016:01:04,09:00:00,53645
2016:01:04,09:20:00,0
2016:01:04,09:40:00,0
2016:01:04,10:00:00,1468
2016:01:05,10:00:00,246
2016:01:05,10:20:00,0
2016:01:05,10:40:00,0
2016:01:05,11:00:00,0
2016:01:05,11:20:00,0
2016:01:05,11:40:00,0
2016:01:05,12:00:00,213"""
def converter1(x):
#convert to datetime and then to times
return pd.to_datetime(x).time()
def converter2(x):
#define format of datetime
return pd.to_datetime(x, format='%Y:%m:%d')
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp),
index_col=['Date','Time'],
thousands="'",
skipinitialspace=True,
converters={'Time': converter1, 'Date': converter2})
print (df)
Volume
Date Time
2016-01-04 09:00:00 53645
09:20:00 0
09:40:00 0
10:00:00 1468
2016-01-05 10:00:00 246
10:20:00 0
10:40:00 0
11:00:00 0
11:20:00 0
11:40:00 0
12:00:00 213
Sometimes is possible use built-in parser, e.g. if format of dates is YY-MM-DD:
import pandas as pd
temp=u"""Date,Time,Volume
2016-01-04,09:00:00,53645
2016-01-04,09:20:00,0
2016-01-04,09:40:00,0
2016-01-04,10:00:00,1468
2016-01-05,10:00:00,246
2016-01-05,10:20:00,0
2016-01-05,10:40:00,0
2016-01-05,11:00:00,0
2016-01-05,11:20:00,0
2016-01-05,11:40:00,0
2016-01-05,12:00:00,213"""
def converter(x):
#define format of datetime
return pd.to_datetime(x).time()
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp),
index_col=['Date','Time'],
parse_dates=['Date'],
thousands="'",
skipinitialspace=True,
converters={'Time': converter})
print (df.index.get_level_values(0))
DatetimeIndex(['2016-01-04', '2016-01-04', '2016-01-04', '2016-01-04',
'2016-01-05', '2016-01-05', '2016-01-05', '2016-01-05',
'2016-01-05', '2016-01-05', '2016-01-05'],
dtype='datetime64[ns]', name='Date', freq=None)
Last possible solution is convert datetime to times in MultiIndex by set_levels - after processing:
df.index = df.index.set_levels(df.index.get_level_values(1).time, level=1)
print (df)
Volume
Date Time
2016-01-04 09:00:00 53645
09:20:00 0
09:40:00 0
10:00:00 1468
2016-01-05 10:00:00 246
10:00:00 0
10:20:00 0
10:40:00 0
11:00:00 0
11:20:00 0
11:40:00 213
2nd question:
Panel in pandas 0.20.+ is deprecated and will be removed in a future version.
To convert to a time series use pd.to_timedelta.
Ex:
import pandas as pd
df = pd.DataFrame({"Time": ["2018-04-25 09:01:29", "2018-04-25 10:01:29", "2018-04-25 10:01:29"]})
df["Time"] = pd.to_timedelta(pd.to_datetime(df["Time"]).dt.strftime('%H:%M:%S'))
print df["Time"]
Output:
0 09:01:29
1 10:01:29
2 10:01:29
Name: Time, dtype: timedelta64[ns]

Find difference between dates pandas using loc function

I have this dataframe
open high low close volume
TimeStamp
2017-12-22 13:15:00 12935.00 13200.00 12508.71 12514.91 244.728611
2017-12-22 13:30:00 12514.91 12999.99 12508.71 12666.34 150.457869
2017-12-22 13:45:00 12666.33 12899.97 12094.00 12094.00 198.680014
2017-12-22 14:00:00 12094.01 12354.99 11150.00 11150.00 256.812634
2017-12-22 14:15:00 11150.01 12510.00 10400.00 12276.33 262.217127
I want to know if every rows have exactly 15 minutes diference in time
So I build a new column with a shift of the first columns
open high low close volume \
TimeStamp
2017-12-20 13:30:00 17503.98 17600.00 17100.57 17119.89 312.773644
2017-12-20 13:45:00 17119.89 17372.98 17049.00 17170.00 322.953671
2017-12-20 14:00:00 17170.00 17573.00 17170.00 17395.74 236.085829
2017-12-20 14:15:00 17395.74 17398.00 17200.01 17280.00 220.467382
2017-12-20 14:30:00 17280.00 17313.94 17150.00 17256.05 222.760598
new_time
TimeStamp
2017-12-20 13:30:00 2017-12-20 13:45:00
2017-12-20 13:45:00 2017-12-20 14:00:00
2017-12-20 14:00:00 2017-12-20 14:15:00
2017-12-20 14:15:00 2017-12-20 14:30:00
2017-12-20 14:30:00 2017-12-20 14:45:00
Now I want to locate every row that don't respect the 15minutes diference rule so I did
dfh.loc[(dfh['new_time'].to_pydatetime()-dfh.index.to_pydatetime())>datetime.timedelta(0, 900)]
I get this error,
Traceback (most recent call last):
File "<pyshell#252>", line 1, in <module>
dfh.loc[(dfh['new_time'].to_pydatetime()-dfh.index.to_pydatetime())>datetime.timedelta(0, 900)]
File "C:\Users\Araujo\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\generic.py", line 3614, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'to_pydatetime'
Is there any way of do this?
EDIT:
Shift just works with periodic, there is any way of do this with non periodic?
This would work:
import pandas as pd
import numpy as np
import datetime as dt
data = [
['2017-12-22 13:15:00', 12935.00, 13200.00, 12508.71, 12514.91, 244.728611],
['2017-12-22 13:30:00', 12514.91, 12999.99, 12508.71, 12666.34, 150.457869],
['2017-12-22 13:45:00', 12666.33, 12899.97, 12094.00, 12094.00, 198.680014],
['2017-12-22 14:00:00', 12094.01, 12354.99, 11150.00, 11150.00, 256.812634],
['2017-12-22 14:15:00', 11150.01, 12510.00, 10400.00, 12276.33, 262.217127]
]
df = pd.DataFrame(data, columns = ['Timestamp', 'open', 'high', 'low', 'close', 'volume'])
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df['plus_15'] = df['Timestamp'].shift(1) + dt.timedelta(minutes = 15)
df['valid_time'] = np.where((df['Timestamp'] == df['plus_15']) | (df.index == 0), 1, 0)
print(df[['Timestamp', 'valid_time']])
#output
Timestamp valid_time
0 2017-12-22 13:15:00 1
1 2017-12-22 13:30:00 1
2 2017-12-22 13:45:00 1
3 2017-12-22 14:00:00 1
4 2017-12-22 14:15:00 1
So create a new column, plus 15, that looks at the previous timestamp and adds 15 minutes to it. Then create another column, valid time, which compares the timestamp column to the plus 15 column, and marks 1 when they are equal and 0 when they are not.
Can we do something like this?
import pandas as pd
import numpy as np
data = '''\
TimeStamp open high low close volume
2017-12-22T13:15:00 12935.00 13200.00 12508.71 12514.91 244.728611
2017-12-22T13:30:00 12514.91 12999.99 12508.71 12666.34 150.457869
2017-12-22T13:45:00 12666.33 12899.97 12094.00 12094.00 198.680014
2017-12-22T14:00:00 12094.01 12354.99 11150.00 11150.00 256.812634
2017-12-22T14:15:00 11150.01 12510.00 10400.00 12276.33 262.217127'''
df = pd.read_csv(pd.compat.StringIO(data),
sep='\s+', parse_dates=['TimeStamp'], index_col=['TimeStamp'])
df['new_time'] = df.index[1:].tolist()+[np.NaN]
# df['new_time'] = np.roll(df.index, -1) # if last is not first+15min
# use boolean indexing to filter away unwanted rows
df[[(dt2-dt1)/np.timedelta64(1, 's') == 900
for dt1,dt2 in zip(df.index.values,df.new_time.values)]]

Categories