How to iterate through a time frame? - python

Okay so I have some S&P 500 minute data from a csv file. I am looking to iterate through a timestamp based on time. So far the code looks like this:
import datetime as dt
import pandas as pd
d = pd.read_csv('/Volumes/Seagate Portable/usindex_2020_all_tickers_awvbxk9/SPX_2020_2020.txt')
d.columns = ['Dates', 'Open', 'High', 'Low', 'Close']
d.Dates = pd.to_datetime(d.Dates)
d = d[(d.Dates.dt.time == dt.time(9, 30)) | (d.Dates.dt.time == dt.time(16, 0))].copy()
d.drop(['High', 'Low'], axis=1, inplace=True)
d.index = range(len(d.Open))
for i in d.index:
if dt.time(16, 0) in d.Dates[i]:
d['Open'][i] == np.NaN
The imported csv looks like this:
Date Open Close
0 2020-01-02 16:00:00 3258.14 3257.98
1 2020-01-03 09:30:00 3226.36 3225.79
2 2020-01-03 16:00:00 3234.35 3234.57
3 2020-01-06 09:30:00 3217.55 3215.01
4 2020-01-06 16:00:00 3246.23 3246.28
5 2020-01-07 09:30:00 3241.86 3238.09
6 2020-01-07 16:00:00 3237.13 3237.18
7 2020-01-08 09:30:00 3238.59 3236.82
8 2020-01-08 16:00:00 3253.21 3253.06
9 2020-01-09 09:30:00 3266.03 3270.29
10 2020-01-09 16:00:00 3274.74 3274.66
11 2020-01-10 09:30:00 3281.81 3281.20
12 2020-01-10 16:00:00 3265.39 3265.34
13 2020-01-13 09:30:00 3271.13 3273.28
14 2020-01-13 16:00:00 3287.98 3288.05
15 2020-01-14 09:30:00 3285.35 3285.09
16 2020-01-14 16:00:00 3282.93 3282.89
17 2020-01-15 09:30:00 3282.27 3281.75
18 2020-01-15 16:00:00 3289.76 3289.40
19 2020-01-16 09:30:00 3302.97 3304.34
I am getting the error Im getting is TypeError: argument of type 'Timestamp' is not iterable
What I am trying to do is fill all Open values at 16:00:00 NaN values, then keep the Close valyes for that time. I can I either iterate through the time stamp with the same for loop? Or is there another possible way to sort through this and fill in the respective NaN values? Thanks!

in is used to test for membership in a collection or to find substring within a string. You cannot use it to test for the time in a Timestamp.
If you want to use a for loop:
for i in d.index:
if d.loc[i, 'Date'].time() == dt.time(16,0):
d.loc[i, 'Open'] == np.NaN
But it's always better to use a vectorized function:
d['Open'] = d['Open'].mask(d['Dates'].dt.time == dt.time(16, 0))

for i in d.index:
if dt.time(16, 0) == d.Dates[i]:
d['Open'].loc[i] = np.nan
or
for i in d.index:
if dt.time(16, 0) is d.Dates[i]:
d['Open'].loc[i] = np.nan

Related

Measure different between timestamps using conditions - python

I'm trying to measure the difference between timestamps using certain conditions. Using below, for each unique ID, I'm hoping to subtract the End Time where Item == A and the Start Time where Item == D.
So the timestamps are actually located on separate rows.
At the moment my process is returning an error. I'm also hoping to drop the .shift() for something more robust as each unique ID will have different combinations. For ex, A,B,C,D - A,B,D - A,D etc.
df = pd.DataFrame({'ID': [10,10,10,20,20,30],
'Start Time': ['2019-08-02 09:00:00','2019-08-03 10:50:00','2019-08-05 16:00:00','2019-08-04 08:00:00','2019-08-04 15:30:00','2019-08-06 11:00:00'],
'End Time': ['2019-08-04 15:00:00','2019-08-04 16:00:00','2019-08-05 16:00:00','2019-08-04 14:00:00','2019-08-05 20:30:00','2019-08-07 10:00:00'],
'Item': ['A','B','D','A','D','A'],
})
df['Start Time'] = pd.to_datetime(df['Start Time'])
df['End Time'] = pd.to_datetime(df['End Time'])
df['diff'] = (df.groupby('ID')
.apply(lambda x: x['End Time'].shift(1) - x['Start Time'].shift(1))
.reset_index(drop=True))
Intended Output:
ID Start Time End Time Item diff
0 10 2019-08-02 09:00:00 2019-08-04 15:00:00 A NaT
1 10 2019-08-03 10:50:00 2019-08-04 16:00:00 B NaT
2 10 2019-08-05 16:00:00 2019-08-05 16:00:00 D 1 days 01:00:00
3 20 2019-08-04 08:00:00 2019-08-04 14:00:00 A NaT
4 20 2019-08-04 15:30:00 2019-08-05 20:30:00 D 0 days 01:30:00
5 30 2019-08-06 11:00:00 2019-08-07 10:00:00 A NaT
df2 = df.set_index('ID')
df2.query('Item == "D"')['Start Time']-df2.query('Item == "A"')['End Time']
output:
ID
10 2 days 05:30:00
20 0 days 20:30:00
30 NaT
dtype: timedelta64[ns]
older answer
The issue is your fillna, you can't have strings in a timedelta column:
df['diff'] = (df.groupby('ID')
.apply(lambda x: x['End Time'].shift(1) - x['Start Time'].shift(1))
#.fillna('-') # the issue is here
.reset_index(drop=True))
output:
ID Start Time End Time Item diff
0 10 2019-08-02 09:00:00 2019-08-02 09:30:00 A NaT
1 10 2019-08-03 10:50:00 2019-08-03 11:00:00 B 0 days 00:30:00
2 10 2019-08-04 15:00:00 2019-08-05 16:00:00 C 0 days 00:10:00
3 20 2019-08-04 08:00:00 2019-08-04 14:00:00 B NaT
4 20 2019-08-05 10:30:00 2019-08-05 20:30:00 C 0 days 06:00:00
5 30 2019-08-06 11:00:00 2019-08-07 10:00:00 A NaT
IIUC use:
df1 = df.pivot('ID','Item')
print (df1)
Start Time \
Item A B D
ID
10 2019-08-02 09:00:00 2019-08-03 10:50:00 2019-08-04 15:00:00
20 2019-08-04 08:00:00 NaT 2019-08-05 10:30:00
30 2019-08-06 11:00:00 NaT NaT
End Time
Item A B D
ID
10 2019-08-02 09:30:00 2019-08-03 11:00:00 2019-08-05 16:00:00
20 2019-08-04 14:00:00 NaT 2019-08-05 20:30:00
30 2019-08-07 10:00:00 NaT NaT
a = df1[('Start Time','D')].sub(df1[('End Time','A')])
print (a)
ID
10 2 days 05:30:00
20 0 days 20:30:00
30 NaT
dtype: timedelta64[ns]

Filter rows in DataFrame where certain conditions are met?

I have a DataFrame with relevant stock information that looks like this.
Screenshot of my dataframe
I need it so that if the 'close' from one row is different from the 'open' in the next row a new dataframe will be created storing the ones that fulfill this criteria. I would like that all of the values from the row to be saved in the new dataframe. To clarify, I would like the two rows where this happens to be stored in the new dataframe.
DataFrame as text as requested:
timestamp open high low close volume
0 2020-01-01 00:00:00 129.16 130.98 128.68 130.24 4.714333e+04
1 2020-01-01 08:00:00 130.24 132.40 129.87 132.08 5.183323e+04
2 2020-01-01 16:00:00 132.08 133.05 129.74 130.77 4.579396e+04
3 2020-01-02 00:00:00 130.72 130.78 128.69 129.26 6.606601e+04
4 2020-01-02 08:00:00 129.23 130.28 128.90 129.59 4.849893e+04
5 2020-01-02 16:00:00 129.58 129.78 126.38 127.19 9.919212e+04
6 2020-01-03 00:00:00 127.19 130.15 125.88 128.86 1.276414e+05
This can be accomplished using Series.shift
>>> df['close'] != df['open'].shift(-1)
0 2020-01-01 False
1 2020-01-01 False
2 2020-01-01 True
3 2020-01-02 True
4 2020-01-02 True
5 2020-01-02 False
6 2020-01-03 True
This compares the close value in one row to the open value of the next row ("shifted" one row ahead).
You can then select the rows for which the condition is True.
>>> df[df['close'] != df['open'].shift(-1)]
timestamp open high low close volume
2 2020-01-01 16:00:00 132.08 133.05 129.74 130.77 45793.96
3 2020-01-02 00:00:00 130.72 130.78 128.69 129.26 66066.01
4 2020-01-02 08:00:00 129.23 130.28 128.90 129.59 48498.93
6 2020-01-03 00:00:00 127.19 130.15 125.88 128.86 127641.40
This only returns the second of the two rows; to get the first, we can then shift back one, and unite the two conditions.
>>> row_condition = df['close'] != df['open'].shift(-1)
>>> row_before = row_condition.shift(1)
>>> df[row_condition | row_before]
timestamp open high low close volume
0 2020-01-01 00:00:00 129.16 130.98 128.68 130.24 47143.33
1 2020-01-01 08:00:00 130.24 132.40 129.87 132.08 51833.23
2 2020-01-01 16:00:00 132.08 133.05 129.74 130.77 45793.96
3 2020-01-02 00:00:00 130.72 130.78 128.69 129.26 66066.01
4 2020-01-02 08:00:00 129.23 130.28 128.90 129.59 48498.93
5 2020-01-02 16:00:00 129.58 129.78 126.38 127.19 99192.12
6 2020-01-03 00:00:00 127.19 130.15 125.88 128.86 127641.40
Providing a textual sample of the DataFrame is useful because this can be copied directly into a Python session; I would have had to manually type the content of your screenshot otherwise.

Most effective method to get the max value from a column based on a timedelta calculated from the current row

I would like to identify the maximum value in a column that occurs within the following X days from the current date.
This is a subselect of the data frame showing the daily values for 2020.
Date Data
6780 2020-01-02 323.540009
6781 2020-01-03 321.160004
6782 2020-01-06 320.489990
6783 2020-01-07 323.019989
6784 2020-01-08 322.940002
... ... ...
7028 2020-12-24 368.079987
7029 2020-12-28 371.739990
7030 2020-12-29 373.809998
7031 2020-12-30 372.339996
I would like to find a way to identify the max value within the following 30 days. e.g.
Date Data Max
6780 2020-01-02 323.540009 323.019989
6781 2020-01-03 321.160004 323.019989
6782 2020-01-06 320.489990 323.730011
6783 2020-01-07 323.019989 323.540009
6784 2020-01-08 322.940002 325.779999
... ... ... ...
7028 2020-12-24 368.079987 373.809998
7029 2020-12-28 371.739990 373.809998
7030 2020-12-29 373.809998 372.339996
7031 2020-12-30 372.339996 373.100006
I tried calculating the start and end dates and storing them in the columns. e.g.
df['startDate'] = df['Date'] + pd.to_timedelta(1, unit='d')
df['endDate'] = df['Date'] + pd.to_timedelta(30, unit='d')
before trying to calculate the max. e.g,
df['Max'] = df.loc[(df['Date'] > df['startDate']) & (df['Date'] < df['endDate'])]['Data'].max()
But this results in;
Date Data startDate endDate Max
6780 2020-01-02 323.540009 2020-01-03 2020-01-29 NaN
6781 2020-01-03 321.160004 2020-01-04 2020-01-30 NaN
6782 2020-01-06 320.489990 2020-01-07 2020-02-02 NaN
6783 2020-01-07 323.019989 2020-01-08 2020-02-03 NaN
6784 2020-01-08 322.940002 2020-01-09 2020-02-04 NaN
... ... ... ... ... ...
7027 2020-12-23 368.279999 2020-12-24 2021-01-19 NaN
7028 2020-12-24 368.079987 2020-12-25 2021-01-20 NaN
7029 2020-12-28 371.739990 2020-12-29 2021-01-24 NaN
7030 2020-12-29 373.809998 2020-12-31 2021-01-26 NaN
If I statically add dates to the loc[] statement, it partially works, filling the max for this static range however this just gives me the same value for every field.
Any help on the correct panda way to achieve this would be appreciated.
Kind Regards
df.rolling can do this if you make the date a datetime object as the axis:
df["Date"] = pd.to_datetime(df.Date)
df.set_index("Date").rolling("2d").max()
output:
Data
Date
2020-01-02 323.540009
2020-01-03 323.540009
2020-01-06 320.489990
2020-01-07 323.019989
2020-01-08 323.019989

`set_index()` does not sort index?

Concatenating this dask DataFrame to this pandas DataFrame and using set_index to sort index does not result in a sorted index. Is this normal?
from dask import dataframe as dd
import pandas as pd
a=list('aabbccddeeffgghhi')
df = pd.DataFrame(dict(a=a),
index = pd.date_range(start='2010/01/01', end='2010/02/01', periods=len(a))).reset_index()
ddf = dd.from_pandas(df, npartitions=5)
a2=list('aabbccddeef')
df2 = pd.DataFrame(dict(a=a2),
index = pd.date_range(start='2020/01/01',end='2020/01/06', periods=len(a2))).reset_index()
ddf2 = dd.concat([ddf, df2]).set_index('index')
ddf2.compute()
a
index
2010-01-01 00:00:00 a
2010-01-02 22:30:00 a
2010-01-04 21:00:00 b
2010-01-06 19:30:00 b
2010-01-08 18:00:00 c
2010-01-10 16:30:00 c
2010-01-12 15:00:00 d
2010-01-14 13:30:00 d
2010-01-16 12:00:00 e
2010-01-18 10:30:00 e
2010-01-20 09:00:00 f
2010-01-22 07:30:00 f
2010-01-24 06:00:00 g
2010-01-26 04:30:00 g
2010-01-28 03:00:00 h
2010-01-30 01:30:00 h
2010-02-01 00:00:00 i
2020-01-01 00:00:00 a
2020-01-01 12:00:00 a
2020-01-02 00:00:00 b
2020-01-02 12:00:00 b
2020-01-03 00:00:00 c
2020-01-03 12:00:00 c
2020-01-04 00:00:00 d
2020-01-04 12:00:00 d
2020-01-05 00:00:00 e
2020-01-05 12:00:00 e
2020-01-06 00:00:00 f
Please, do I do something the wrong way?
Yes, it is completly normal because most pandas operations don not assume a sorted index, some do though.
In dask dataframes you must apply
ddf2 = dd.concat([ddf, df2]).set_index('index', sorted = True).
By the way, your data is already properly sorted by index. Regard the years (2010, 2020)
Try sort_index to sort. Set index is just to set the index, doesn't force a resorting.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_index.html

How to properly pivot or reshape a timeseries dataframe in Pandas?

I need to reshape a dataframe that looks like df1 and turn it into df2. There are 2 considerations for this procedure:
I need to be able to set the number of rows to be sliced as a parameter (length).
I need to split date and time from the index, and use date in the reshape as the column names and keep time as the index.
Current df1
2007-08-07 18:00:00 1
2007-08-08 00:00:00 2
2007-08-08 06:00:00 3
2007-08-08 12:00:00 4
2007-08-08 18:00:00 5
2007-11-02 18:00:00 6
2007-11-03 00:00:00 7
2007-11-03 06:00:00 8
2007-11-03 12:00:00 9
2007-11-03 18:00:00 10
Desired Output df2 - With the parameter 'length=5'
2007-08-07 2007-11-02
18:00:00 1 6
00:00:00 2 7
06:00:00 3 8
12:00:00 4 9
18:00:00 5 10
What have I done:
My approach was to create a multi-index (Date - Time) and then do a pivot table or some sort of reshape to achieve the desired df output.
import pandas as pd
'''
First separate time and date
'''
df['TimeStamp'] = df.index
df['date'] = df.index.date
df['time'] = df.index.time
'''
Then create a way to separate the slices and make those specific dates available for then create
a multi-index.
'''
for index, row in df.iterrows():
df['Num'] = np.arange(len(df))
for index, row in df.iterrows():
if row['Num'] % 5 == 0:
df.loc[index, 'EventDate'] = df.loc[index, 'Date']
df.set_index(['EventDate', 'Hour'], inplace=True)
del df['Date']
del df['Num']
del df['TimeStamp']
Problem: There's a NaN appears next to each date of the first level of the multi-index. And even if that worked well, I can't find how to do what I need with a multiindex df.
I'm stuck. I appreciate any input.
import numpy as np
import pandas as pd
import io
data = '''\
val
2007-08-07 18:00:00 1
2007-08-08 00:00:00 2
2007-08-08 06:00:00 3
2007-08-08 12:00:00 4
2007-08-08 18:00:00 5
2007-11-02 18:00:00 6
2007-11-03 00:00:00 7
2007-11-03 06:00:00 8
2007-11-03 12:00:00 9
2007-11-03 18:00:00 10'''
df = pd.read_table(io.BytesIO(data), sep='\s{2,}', parse_dates=True)
chunksize = 5
chunks = len(df)//chunksize
df['Date'] = np.repeat(df.index.date[::chunksize], chunksize)[:len(df)]
index = df.index.time[:chunksize]
df['Time'] = np.tile(np.arange(chunksize), chunks)
df = df.set_index(['Date', 'Time'], append=False)
df = df['val'].unstack('Date')
df.index = index
print(df)
yields
Date 2007-08-07 2007-11-02
18:00:00 1 6
00:00:00 2 7
06:00:00 3 8
12:00:00 4 9
18:00:00 5 10
Note that the final DataFrame has an index with non-unique entries. (The
18:00:00 is repeated.) Some DataFrame operations are problematic when the
index has repeated entries, so in general it is better to avoid this if
possible.
First of all I'm assuming your datetime column is actually a datetime type if not use df['t'] = pd.to_datetime(df['t']) to convert.
Then set your index using a multindex and unstack...
df.index = pd.MultiIndex.from_tuples(df['t'].apply(lambda x: [x.time(),x.date()]))
df['v'].unstack()
This would be a canonical approach for pandas:
First, setup with imports and data:
import pandas as pd
import StringIO
txt = '''2007-08-07 18:00:00 1
2007-08-08 00:00:00 2
2007-08-08 06:00:00 3
2007-08-08 12:00:00 4
2007-08-08 18:00:00 5
2007-11-02 18:00:00 6
2007-11-03 00:00:00 7
2007-11-03 06:00:00 8
2007-11-03 12:00:00 9
2007-11-03 18:00:00 10'''
Now read in the DataFrame, and pivot on the correct columns:
df1 = pd.read_csv(StringIO.StringIO(txt), sep=' ',
names=['d', 't', 'n'], )
print(df1.pivot(index='t', columns='d', values='n'))
prints a pivoted df:
d 2007-08-07 2007-08-08 2007-11-02 2007-11-03
t
00:00:00 NaN 2 NaN 7
06:00:00 NaN 3 NaN 8
12:00:00 NaN 4 NaN 9
18:00:00 1 5 6 10
You won't get a length of 5, though. The following,
2007-08-07 2007-11-02
18:00:00 1 6
00:00:00 2 7
06:00:00 3 8
12:00:00 4 9
18:00:00 5 10
is incorrect, as you have 18:00:00 twice for the same date, and in your initial data, they apply to different dates.

Categories