Related
I have a df that looks like this (shortened):
DateTime Value Date Time
0 2022-09-18 06:00:00 5.4 18/09/2022 06:00
1 2022-09-18 07:00:00 6.0 18/09/2022 07:00
2 2022-09-18 08:00:00 6.5 18/09/2022 08:00
3 2022-09-18 09:00:00 6.7 18/09/2022 09:00
8 2022-09-18 14:00:00 7.9 18/09/2022 14:00
9 2022-09-18 15:00:00 7.8 18/09/2022 15:00
10 2022-09-18 16:00:00 7.6 18/09/2022 16:00
11 2022-09-18 17:00:00 6.8 18/09/2022 17:00
12 2022-09-18 18:00:00 6.4 18/09/2022 18:00
13 2022-09-18 19:00:00 5.7 18/09/2022 19:00
14 2022-09-18 20:00:00 4.8 18/09/2022 20:00
15 2022-09-18 21:00:00 5.4 18/09/2022 21:00
16 2022-09-18 22:00:00 4.7 18/09/2022 22:00
17 2022-09-18 23:00:00 4.3 18/09/2022 23:00
18 2022-09-19 00:00:00 4.1 19/09/2022 00:00
19 2022-09-19 01:00:00 4.4 19/09/2022 01:00
22 2022-09-19 04:00:00 3.5 19/09/2022 04:00
23 2022-09-19 05:00:00 2.8 19/09/2022 05:00
24 2022-09-19 06:00:00 3.8 19/09/2022 06:00
I want to create a new column where i split the between day and night like this:
00:00 - 05:00 night ,
06:00 - 18:00 day ,
19:00 - 23:00 night
But apparently one can't use same label? How can I solve this problem? Here is my code
df['period'] = pd.cut(pd.to_datetime(df.DateTime).dt.hour,
bins=[0, 5, 17, 23],
labels=['night', 'morning', 'night'],
include_lowest=True)
It's returning
ValueError: labels must be unique if ordered=True; pass ordered=False for duplicate labels
if i understood correctly, if time is between 00:00 - 05:00 or 19:00 - 23:00, you want your new column to say 'night', else 'day', well here's that code:
df['day/night'] = df['Time'].apply(lambda x: 'night' if '00:00' <= x <= '05:00' or '19:00' <= x <= '23:00' else 'day')
or you can add ordered = false parameter using your method
input ->
df = pd.DataFrame(columns=['DateTime', 'Value', 'Date', 'Time'], data=[
['2022-09-18 06:00:00', 5.4, '18/09/2022', '06:00'],
['2022-09-18 07:00:00', 6.0, '18/09/2022', '07:00'],
['2022-09-18 08:00:00', 6.5, '18/09/2022', '08:00'],
['2022-09-18 09:00:00', 6.7, '18/09/2022', '09:00'],
['2022-09-18 14:00:00', 7.9, '18/09/2022', '14:00'],
['2022-09-18 15:00:00', 7.8, '18/09/2022', '15:00'],
['2022-09-18 16:00:00', 7.6, '18/09/2022', '16:00'],
['2022-09-18 17:00:00', 6.8, '18/09/2022', '17:00'],
['2022-09-18 18:00:00', 6.4, '18/09/2022', '18:00'],
['2022-09-18 19:00:00', 5.7, '18/09/2022', '19:00'],
['2022-09-18 20:00:00', 4.8, '18/09/2022', '20:00'],
['2022-09-18 21:00:00', 5.4, '18/09/2022', '21:00'],
['2022-09-18 22:00:00', 4.7, '18/09/2022', '22:00'],
['2022-09-18 23:00:00', 4.3, '18/09/2022', '23:00'],
['2022-09-19 00:00:00', 4.1, '19/09/2022', '00:00'],
['2022-09-19 01:00:00', 4.4, '19/09/2022', '01:00'],
['2022-09-19 04:00:00', 3.5, '19/09/2022', '04:00'],
['2022-09-19 05:00:00', 2.8, '19/09/2022', '05:00'],
['2022-09-19 06:00:00', 3.8, '19/09/2022', '06:00']])
output ->
DateTime Value Date Time is_0600_0900
0 2022-09-18 06:00:00 5.4 18/09/2022 06:00 day
1 2022-09-18 07:00:00 6.0 18/09/2022 07:00 day
2 2022-09-18 08:00:00 6.5 18/09/2022 08:00 day
3 2022-09-18 09:00:00 6.7 18/09/2022 09:00 day
4 2022-09-18 14:00:00 7.9 18/09/2022 14:00 day
5 2022-09-18 15:00:00 7.8 18/09/2022 15:00 day
6 2022-09-18 16:00:00 7.6 18/09/2022 16:00 day
7 2022-09-18 17:00:00 6.8 18/09/2022 17:00 day
8 2022-09-18 18:00:00 6.4 18/09/2022 18:00 day
9 2022-09-18 19:00:00 5.7 18/09/2022 19:00 night
10 2022-09-18 20:00:00 4.8 18/09/2022 20:00 night
11 2022-09-18 21:00:00 5.4 18/09/2022 21:00 night
12 2022-09-18 22:00:00 4.7 18/09/2022 22:00 night
13 2022-09-18 23:00:00 4.3 18/09/2022 23:00 night
14 2022-09-19 00:00:00 4.1 19/09/2022 00:00 night
15 2022-09-19 01:00:00 4.4 19/09/2022 01:00 night
16 2022-09-19 04:00:00 3.5 19/09/2022 04:00 night
17 2022-09-19 05:00:00 2.8 19/09/2022 05:00 night
18 2022-09-19 06:00:00 3.8 19/09/2022 06:00 day
You have two options.
Either you don't care about the order and you can set ordered=False as parameter of cut:
df['period'] = pd.cut(pd.to_datetime(df.DateTime).dt.hour,
bins=[0, 5, 17, 23],
labels=['night', 'morning', 'night'],
ordered=False,
include_lowest=True)
Or you care to have night and morning ordered, in which case you can further convert to ordered Categorical:
df['period'] = pd.Categorical(df['period'], categories=['night', 'morning'], ordered=True)
output:
DateTime Value Date Time period
0 2022-09-18 06:00:00 5.4 18/09/2022 06:00 morning
1 2022-09-18 07:00:00 6.0 18/09/2022 07:00 morning
2 2022-09-18 08:00:00 6.5 18/09/2022 08:00 morning
3 2022-09-18 09:00:00 6.7 18/09/2022 09:00 morning
8 2022-09-18 14:00:00 7.9 18/09/2022 14:00 morning
9 2022-09-18 15:00:00 7.8 18/09/2022 15:00 morning
10 2022-09-18 16:00:00 7.6 18/09/2022 16:00 morning
11 2022-09-18 17:00:00 6.8 18/09/2022 17:00 morning
12 2022-09-18 18:00:00 6.4 18/09/2022 18:00 night
13 2022-09-18 19:00:00 5.7 18/09/2022 19:00 night
14 2022-09-18 20:00:00 4.8 18/09/2022 20:00 night
15 2022-09-18 21:00:00 5.4 18/09/2022 21:00 night
16 2022-09-18 22:00:00 4.7 18/09/2022 22:00 night
17 2022-09-18 23:00:00 4.3 18/09/2022 23:00 night
18 2022-09-19 00:00:00 4.1 19/09/2022 00:00 night
19 2022-09-19 01:00:00 4.4 19/09/2022 01:00 night
22 2022-09-19 04:00:00 3.5 19/09/2022 04:00 night
23 2022-09-19 05:00:00 2.8 19/09/2022 05:00 night
24 2022-09-19 06:00:00 3.8 19/09/2022 06:00 morning
column:
df['period']
0 morning
1 morning
2 morning
...
23 night
24 morning
Name: period, dtype: category
Categories (2, object): ['morning', 'night']
Dataset
df_one = ['2017-07-27 04:00:00', '2017-08-07 04:00:00', '2017-08-11 20:00:00', '2017-08-15 16:00:00', '2017-08-21 20:00:00', '2017-08-23 08:00:00', '2017-08-23 16:00:00', '2017-08-31 20:00:00', '2017-09-01 08:00:00', '2017-09-01 16:00:00', '2017-09-01 20:00:00', '2017-09-04 00:00:00', '2017-09-04 20:00:00', '2017-09-05 00:00:00', '2017-09-05 04:00:00', '2017-09-12 12:00:00', '2017-09-13 12:00:00', '2017-09-14 00:00:00', '2017-09-18 04:00:00', '2017-09-21 08:00:00', '2017-09-22 16:00:00', '2017-09-25 08:00:00', '2017-10-10 12:00:00', '2017-10-16 16:00:00', '2017-10-19 12:00:00', '2017-10-23 04:00:00', '2017-10-26 00:00:00', '2017-10-27 00:00:00', '2017-11-10 04:00:00', '2017-11-21 08:00:00', '2017-11-22 16:00:00', '2017-11-30 00:00:00', '2017-11-30 08:00:00', '2017-11-30 16:00:00', '2017-12-01 00:00:00', '2017-12-04 20:00:00', '2017-12-14 08:00:00', '2017-12-15 12:00:00', '2017-12-15 16:00:00', '2017-12-18 00:00:00', '2017-12-19 12:00:00', '2018-01-08 20:00:00', '2018-01-11 20:00:00', '2018-02-06 04:00:00', '2018-02-13 20:00:00', '2018-02-20 08:00:00', '2018-03-02 20:00:00', '2018-03-09 08:00:00', '2018-03-13 20:00:00', '2018-03-16 00:00:00', '2018-03-20 08:00:00', '2018-03-20 16:00:00', '2018-03-22 08:00:00', '2018-03-29 04:00:00', '2018-04-09 20:00:00', '2018-04-13 20:00:00', '2018-04-16 00:00:00', '2018-04-20 08:00:00', '2018-05-11 20:00:00', '2018-05-15 16:00:00', '2018-05-31 16:00:00', '2018-06-13 12:00:00', '2018-06-14 00:00:00', '2018-06-14 20:00:00', '2018-06-22 16:00:00', '2018-06-27 20:00:00', '2018-06-29 20:00:00', '2018-07-03 00:00:00', '2018-07-03 04:00:00', '2018-07-12 04:00:00', '2018-07-16 20:00:00', '2018-07-18 00:00:00', '2018-07-20 20:00:00', '2018-07-27 04:00:00', '2018-07-31 00:00:00', '2018-08-02 00:00:00', '2018-08-20 04:00:00', '2018-09-03 00:00:00', '2018-09-06 08:00:00', '2018-09-07 20:00:00', '2018-09-13 00:00:00', '2018-09-27 16:00:00', '2018-10-11 08:00:00', '2018-10-17 20:00:00', '2018-11-02 00:00:00', '2018-11-05 20:00:00', '2018-11-06 00:00:00', '2018-11-09 04:00:00', '2018-11-16 08:00:00', '2018-11-23 20:00:00', '2018-11-29 12:00:00', '2018-12-03 00:00:00', '2018-12-03 16:00:00', '2018-12-03 20:00:00', '2018-12-04 08:00:00', '2018-12-05 08:00:00', '2018-12-07 00:00:00', '2018-12-12 00:00:00', '2018-12-13 12:00:00', '2018-12-13 20:00:00', '2018-12-18 08:00:00', '2018-12-27 00:00:00', '2018-12-28 00:00:00', '2019-01-03 00:00:00', '2019-01-07 08:00:00', '2019-01-14 20:00:00', '2019-01-15 08:00:00', '2019-01-15 16:00:00', '2019-01-28 04:00:00', '2019-02-05 12:00:00', '2019-02-18 20:00:00', '2019-02-19 12:00:00', '2019-02-20 00:00:00', '2019-03-04 16:00:00', '2019-03-13 00:00:00', '2019-03-22 20:00:00', '2019-04-08 20:00:00', '2019-04-18 16:00:00', '2019-04-30 16:00:00', '2019-05-03 00:00:00', '2019-05-07 04:00:00', '2019-05-08 00:00:00', '2019-05-08 12:00:00', '2019-05-09 08:00:00', '2019-05-09 12:00:00', '2019-05-09 16:00:00', '2019-05-09 20:00:00', '2019-05-15 04:00:00', '2019-05-24 08:00:00', '2019-05-29 00:00:00', '2019-06-03 08:00:00', '2019-06-14 00:00:00', '2019-06-20 12:00:00', '2019-07-01 16:00:00', '2019-07-11 12:00:00', '2019-07-16 16:00:00', '2019-07-19 04:00:00', '2019-07-22 00:00:00', '2019-08-05 16:00:00', '2019-08-14 04:00:00', '2019-08-26 04:00:00', '2019-08-27 12:00:00', '2019-08-27 16:00:00', '2019-08-28 00:00:00', '2019-09-05 08:00:00', '2019-09-11 20:00:00', '2019-09-13 04:00:00', '2019-09-17 00:00:00', '2019-09-18 04:00:00', '2019-09-19 08:00:00', '2019-09-19 16:00:00', '2019-09-20 20:00:00', '2019-10-03 04:00:00', '2019-10-09 08:00:00', '2019-10-09 16:00:00', '2019-10-25 08:00:00', '2019-10-30 08:00:00', '2019-11-05 12:00:00', '2019-11-18 00:00:00', '2019-11-25 00:00:00', '2019-12-02 20:00:00', '2019-12-09 08:00:00', '2019-12-10 16:00:00', '2019-12-19 00:00:00', '2019-12-19 16:00:00', '2019-12-19 20:00:00', '2019-12-27 08:00:00', '2020-01-03 16:00:00', '2020-01-06 16:00:00', '2020-01-08 00:00:00', '2020-01-14 12:00:00', '2020-01-14 20:00:00', '2020-01-15 20:00:00', '2020-01-17 16:00:00', '2020-01-31 20:00:00', '2020-02-05 04:00:00', '2020-02-24 04:00:00', '2020-02-24 12:00:00', '2020-02-25 00:00:00', '2020-03-12 20:00:00', '2020-03-26 04:00:00', '2020-04-01 20:00:00', '2020-04-08 04:00:00', '2020-04-08 08:00:00', '2020-04-09 20:00:00', '2020-04-16 04:00:00', '2020-04-27 20:00:00', '2020-04-28 12:00:00', '2020-04-28 16:00:00', '2020-05-05 16:00:00', '2020-05-13 04:00:00', '2020-05-14 04:00:00', '2020-05-19 00:00:00', '2020-05-25 00:00:00', '2020-05-26 16:00:00', '2020-06-15 00:00:00', '2020-06-16 08:00:00', '2020-06-17 04:00:00', '2020-06-23 08:00:00', '2020-06-25 12:00:00', '2020-06-29 20:00:00', '2020-06-30 00:00:00', '2020-07-02 04:00:00', '2020-07-03 12:00:00', '2020-07-06 08:00:00', '2020-07-10 16:00:00', '2020-07-10 20:00:00', '2020-08-10 04:00:00', '2020-08-13 08:00:00', '2020-08-20 16:00:00', '2020-08-21 08:00:00', '2020-08-21 16:00:00', '2020-08-27 12:00:00', '2020-08-28 00:00:00', '2020-08-28 12:00:00', '2020-09-02 20:00:00', '2020-09-10 20:00:00', '2020-09-17 04:00:00', '2020-09-18 12:00:00', '2020-09-21 16:00:00', '2020-09-30 00:00:00', '2020-10-14 00:00:00', '2020-10-20 00:00:00', '2020-10-28 04:00:00', '2020-11-05 04:00:00', '2020-11-11 20:00:00', '2020-11-13 00:00:00', '2020-11-24 08:00:00', '2020-11-24 16:00:00', '2020-12-10 08:00:00', '2020-12-10 16:00:00', '2020-12-23 04:00:00', '2020-12-24 12:00:00', '2020-12-24 16:00:00', '2020-12-28 12:00:00', '2021-01-08 08:00:00', '2021-01-21 20:00:00', '2021-01-26 12:00:00', '2021-01-27 00:00:00', '2021-01-27 16:00:00', '2021-02-09 04:00:00', '2021-02-17 08:00:00', '2021-02-19 16:00:00', '2021-02-26 20:00:00', '2021-03-11 20:00:00', '2021-03-12 20:00:00', '2021-03-15 04:00:00', '2021-03-15 12:00:00', '2021-03-18 04:00:00', '2021-03-19 04:00:00', '2021-03-23 04:00:00', '2021-03-23 16:00:00', '2021-04-02 16:00:00', '2021-04-05 00:00:00', '2021-04-06 00:00:00', '2021-05-03 00:00:00', '2021-05-07 04:00:00', '2021-05-13 04:00:00', '2021-05-14 20:00:00', '2021-05-27 08:00:00', '2021-06-01 00:00:00', '2021-06-02 16:00:00', '2021-06-03 08:00:00', '2021-06-03 12:00:00']
df_two = ['2017-08-11 23:59', '2017-09-14 23:59', '2017-10-10 23:59', '2017-10-12 23:59', '2017-10-16 23:59', '2017-10-25 23:59', '2018-04-23 23:59', '2018-07-09 23:59', '2018-07-31 23:59', '2018-08-30 23:59', '2018-09-05 23:59', '2018-09-28 23:59', '2018-11-20 23:59', '2019-01-03 23:59', '2019-01-16 23:59', '2019-01-29 23:59', '2019-02-06 23:59', '2019-04-18 23:59', '2019-05-10 23:59', '2019-06-04 23:59', '2019-06-05 23:59', '2019-07-03 23:59', '2019-07-10 23:59', '2019-07-16 23:59', '2019-08-05 23:59', '2019-10-15 23:59', '2019-10-29 23:59', '2019-12-10 23:59', '2019-12-26 23:59', '2020-01-08 23:59', '2020-01-14 23:59', '2020-01-20 23:59', '2020-02-03 23:59', '2020-03-30 23:59', '2020-05-01 23:59', '2020-05-19 23:59', '2020-10-02 23:59', '2020-10-05 23:59', '2020-10-14 23:59', '2020-11-11 23:59', '2021-01-19 23:59', '2021-01-20 23:59', '2021-02-02 23:59', '2021-02-12 23:59', '2021-02-19 23:59', '2021-02-22 23:59', '2021-03-02 23:59', '2021-04-14 23:59', '2021-04-16 23:59', '2021-05-05 23:59', '2021-05-06 23:59']
I'm looking to find those previous & current rows from df_one where df_two date is in-between 2 consecutive rows of df_one
The sort of logic I'm looking to write is something on the lines below
for each row in df_two:
for each row in df_one:
if df_two > df_one_previous_row & df_two < df_one_current_row:
print(df_one_previous_row & df_one_current_row)
Expected Output
2017-08-11 20:00:00 - 2017-08-11 23:59 - 2017-08-15 16:00:00
Found
2017-09-14 00:00:00 - 2017-09-14 23:59 - 2017-09-18 04:00:00
Found
2017-10-10 12:00:00 - 2017-10-10 23:59 - 2017-10-16 16:00:00
Found
2017-10-10 12:00:00 - 2017-10-12 23:59 - 2017-10-16 16:00:00
Found
2017-10-16 16:00:00 - 2017-10-16 23:59 - 2017-10-19 12:00:00
Found
2017-10-23 04:00:00 - 2017-10-25 23:59 - 2017-10-26 00:00:00
Found
2018-04-20 08:00:00 - 2018-04-23 23:59 - 2018-05-11 20:00:00
Found
2018-07-03 04:00:00 - 2018-07-09 23:59 - 2018-07-12 04:00:00
Found
2018-07-31 00:00:00 - 2018-07-31 23:59 - 2018-08-02 00:00:00
Found
2018-08-20 04:00:00 - 2018-08-30 23:59 - 2018-09-03 00:00:00
Found
2018-09-03 00:00:00 - 2018-09-05 23:59 - 2018-09-06 08:00:00
Found
2018-09-27 16:00:00 - 2018-09-28 23:59 - 2018-10-11 08:00:00
Found
2018-11-16 08:00:00 - 2018-11-20 23:59 - 2018-11-23 20:00:00
Found
2019-01-03 00:00:00 - 2019-01-03 23:59 - 2019-01-07 08:00:00
Found
2019-01-15 16:00:00 - 2019-01-16 23:59 - 2019-01-28 04:00:00
Found
2019-01-28 04:00:00 - 2019-01-29 23:59 - 2019-02-05 12:00:00
Found
2019-02-05 12:00:00 - 2019-02-06 23:59 - 2019-02-18 20:00:00
Found
2019-04-18 16:00:00 - 2019-04-18 23:59 - 2019-04-30 16:00:00
Found
2019-05-09 20:00:00 - 2019-05-10 23:59 - 2019-05-15 04:00:00
Found
2019-06-03 08:00:00 - 2019-06-04 23:59 - 2019-06-14 00:00:00
Found
2019-06-03 08:00:00 - 2019-06-05 23:59 - 2019-06-14 00:00:00
Found
2019-07-01 16:00:00 - 2019-07-03 23:59 - 2019-07-11 12:00:00
Found
2019-07-01 16:00:00 - 2019-07-10 23:59 - 2019-07-11 12:00:00
Found
2019-07-16 16:00:00 - 2019-07-16 23:59 - 2019-07-19 04:00:00
Found
2019-08-05 16:00:00 - 2019-08-05 23:59 - 2019-08-14 04:00:00
Found
2019-10-09 16:00:00 - 2019-10-15 23:59 - 2019-10-25 08:00:00
Found
2019-10-25 08:00:00 - 2019-10-29 23:59 - 2019-10-30 08:00:00
Found
2019-12-10 16:00:00 - 2019-12-10 23:59 - 2019-12-19 00:00:00
Found
2019-12-19 20:00:00 - 2019-12-26 23:59 - 2019-12-27 08:00:00
Found
2020-01-08 00:00:00 - 2020-01-08 23:59 - 2020-01-14 12:00:00
Found
2020-01-14 20:00:00 - 2020-01-14 23:59 - 2020-01-15 20:00:00
Found
2020-01-17 16:00:00 - 2020-01-20 23:59 - 2020-01-31 20:00:00
Found
2020-01-31 20:00:00 - 2020-02-03 23:59 - 2020-02-05 04:00:00
Found
2020-03-26 04:00:00 - 2020-03-30 23:59 - 2020-04-01 20:00:00
Found
2020-04-28 16:00:00 - 2020-05-01 23:59 - 2020-05-05 16:00:00
Found
2020-05-19 00:00:00 - 2020-05-19 23:59 - 2020-05-25 00:00:00
Found
2020-09-30 00:00:00 - 2020-10-02 23:59 - 2020-10-14 00:00:00
Found
2020-09-30 00:00:00 - 2020-10-05 23:59 - 2020-10-14 00:00:00
Found
2020-10-14 00:00:00 - 2020-10-14 23:59 - 2020-10-20 00:00:00
Found
2020-11-11 20:00:00 - 2020-11-11 23:59 - 2020-11-13 00:00:00
Found
2021-01-08 08:00:00 - 2021-01-19 23:59 - 2021-01-21 20:00:00
Found
2021-01-08 08:00:00 - 2021-01-20 23:59 - 2021-01-21 20:00:00
Found
2021-01-27 16:00:00 - 2021-02-02 23:59 - 2021-02-09 04:00:00
Found
2021-02-09 04:00:00 - 2021-02-12 23:59 - 2021-02-17 08:00:00
Found
2021-02-19 16:00:00 - 2021-02-19 23:59 - 2021-02-26 20:00:00
Found
2021-02-19 16:00:00 - 2021-02-22 23:59 - 2021-02-26 20:00:00
Found
2021-02-26 20:00:00 - 2021-03-02 23:59 - 2021-03-11 20:00:00
Found
2021-04-06 00:00:00 - 2021-04-14 23:59 - 2021-05-03 00:00:00
Found
2021-04-06 00:00:00 - 2021-04-16 23:59 - 2021-05-03 00:00:00
Found
2021-05-03 00:00:00 - 2021-05-05 23:59 - 2021-05-07 04:00:00
Found
2021-05-03 00:00:00 - 2021-05-06 23:59 - 2021-05-07 04:00:00
Found
Looping with a for or while is not so performance effective from the looks of it. Please could I get help to write a piece of code for this?
We can use np.searchsorted to find the indices in df_one for the corresponding timestamps in df_two which satisfy the condition of inclusion. Note: The timestamps in df_one must be sorted in order for searchsorted to work properly
one = pd.to_datetime(df_one)
two = pd.to_datetime(df_two)
i = np.searchsorted(one, two)
m = ~np.isin(i, [0, len(one)])
df = pd.DataFrame({'df_two': two})
df.loc[m, 'df_one_prev'] = one[i[m] - 1]
df.loc[m, 'df_one_curr'] = one[i[m]]
df['found'] = np.where(m, 'found', 'not found')
df_two df_one_prev df_one_curr found
0 2017-08-11 23:59:00 2017-08-11 20:00:00 2017-08-15 16:00:00 found
1 2017-09-14 23:59:00 2017-09-14 00:00:00 2017-09-18 04:00:00 found
2 2017-10-10 23:59:00 2017-10-10 12:00:00 2017-10-16 16:00:00 found
3 2017-10-12 23:59:00 2017-10-10 12:00:00 2017-10-16 16:00:00 found
4 2017-10-16 23:59:00 2017-10-16 16:00:00 2017-10-19 12:00:00 found
5 2017-10-25 23:59:00 2017-10-23 04:00:00 2017-10-26 00:00:00 found
6 2018-04-23 23:59:00 2018-04-20 08:00:00 2018-05-11 20:00:00 found
7 2018-07-09 23:59:00 2018-07-03 04:00:00 2018-07-12 04:00:00 found
8 2018-07-31 23:59:00 2018-07-31 00:00:00 2018-08-02 00:00:00 found
9 2018-08-30 23:59:00 2018-08-20 04:00:00 2018-09-03 00:00:00 found
10 2018-09-05 23:59:00 2018-09-03 00:00:00 2018-09-06 08:00:00 found
11 2018-09-28 23:59:00 2018-09-27 16:00:00 2018-10-11 08:00:00 found
12 2018-11-20 23:59:00 2018-11-16 08:00:00 2018-11-23 20:00:00 found
13 2019-01-03 23:59:00 2019-01-03 00:00:00 2019-01-07 08:00:00 found
14 2019-01-16 23:59:00 2019-01-15 16:00:00 2019-01-28 04:00:00 found
15 2019-01-29 23:59:00 2019-01-28 04:00:00 2019-02-05 12:00:00 found
16 2019-02-06 23:59:00 2019-02-05 12:00:00 2019-02-18 20:00:00 found
17 2019-04-18 23:59:00 2019-04-18 16:00:00 2019-04-30 16:00:00 found
18 2019-05-10 23:59:00 2019-05-09 20:00:00 2019-05-15 04:00:00 found
19 2019-06-04 23:59:00 2019-06-03 08:00:00 2019-06-14 00:00:00 found
20 2019-06-05 23:59:00 2019-06-03 08:00:00 2019-06-14 00:00:00 found
21 2019-07-03 23:59:00 2019-07-01 16:00:00 2019-07-11 12:00:00 found
22 2019-07-10 23:59:00 2019-07-01 16:00:00 2019-07-11 12:00:00 found
23 2019-07-16 23:59:00 2019-07-16 16:00:00 2019-07-19 04:00:00 found
24 2019-08-05 23:59:00 2019-08-05 16:00:00 2019-08-14 04:00:00 found
25 2019-10-15 23:59:00 2019-10-09 16:00:00 2019-10-25 08:00:00 found
26 2019-10-29 23:59:00 2019-10-25 08:00:00 2019-10-30 08:00:00 found
27 2019-12-10 23:59:00 2019-12-10 16:00:00 2019-12-19 00:00:00 found
28 2019-12-26 23:59:00 2019-12-19 20:00:00 2019-12-27 08:00:00 found
29 2020-01-08 23:59:00 2020-01-08 00:00:00 2020-01-14 12:00:00 found
30 2020-01-14 23:59:00 2020-01-14 20:00:00 2020-01-15 20:00:00 found
31 2020-01-20 23:59:00 2020-01-17 16:00:00 2020-01-31 20:00:00 found
32 2020-02-03 23:59:00 2020-01-31 20:00:00 2020-02-05 04:00:00 found
33 2020-03-30 23:59:00 2020-03-26 04:00:00 2020-04-01 20:00:00 found
34 2020-05-01 23:59:00 2020-04-28 16:00:00 2020-05-05 16:00:00 found
35 2020-05-19 23:59:00 2020-05-19 00:00:00 2020-05-25 00:00:00 found
36 2020-10-02 23:59:00 2020-09-30 00:00:00 2020-10-14 00:00:00 found
37 2020-10-05 23:59:00 2020-09-30 00:00:00 2020-10-14 00:00:00 found
38 2020-10-14 23:59:00 2020-10-14 00:00:00 2020-10-20 00:00:00 found
39 2020-11-11 23:59:00 2020-11-11 20:00:00 2020-11-13 00:00:00 found
40 2021-01-19 23:59:00 2021-01-08 08:00:00 2021-01-21 20:00:00 found
41 2021-01-20 23:59:00 2021-01-08 08:00:00 2021-01-21 20:00:00 found
42 2021-02-02 23:59:00 2021-01-27 16:00:00 2021-02-09 04:00:00 found
43 2021-02-12 23:59:00 2021-02-09 04:00:00 2021-02-17 08:00:00 found
44 2021-02-19 23:59:00 2021-02-19 16:00:00 2021-02-26 20:00:00 found
45 2021-02-22 23:59:00 2021-02-19 16:00:00 2021-02-26 20:00:00 found
46 2021-03-02 23:59:00 2021-02-26 20:00:00 2021-03-11 20:00:00 found
47 2021-04-14 23:59:00 2021-04-06 00:00:00 2021-05-03 00:00:00 found
48 2021-04-16 23:59:00 2021-04-06 00:00:00 2021-05-03 00:00:00 found
49 2021-05-05 23:59:00 2021-05-03 00:00:00 2021-05-07 04:00:00 found
50 2021-05-06 23:59:00 2021-05-03 00:00:00 2021-05-07 04:00:00 found
I have a dataframe containing trades with duplicated timestamps and buy and sell orders divided over several rows. In my example the total order amount is the sum over the same timestamp for that particular stock. I have created a simplified dataframe to show how the data looks like.
I would like to end up with an dataframe with results from the trades and a trading ID for each trades.
All trades are long positions, ie buy and try to sell at a higher price.
The ID column for the desired output df2 is answered in this thread Create ID column in a pandas dataframe
import pandas as pd
from datetime import datetime
import numpy as np
string_date =['2018-01-01 01:00:00',
'2018-01-01 01:00:00',
'2018-01-01 01:00:00',
'2018-01-01 01:00:00',
'2018-01-01 02:00:00',
'2018-01-01 03:00:00',
'2018-01-01 03:00:00',
'2018-01-01 03:00:00',
'2018-01-01 04:00:00',
'2018-01-01 04:00:00',
'2018-01-01 04:00:00',
'2018-01-01 07:00:00',
'2018-01-01 07:00:00',
'2018-01-01 07:00:00',
'2018-01-01 08:00:00',
'2018-01-01 08:00:00',
'2018-01-01 08:00:00',
'2018-02-01 12:00:00',
]
data ={'stock': ['A','A','A','A','B','A','A','A','C','C','C','B','B','B','C','C','C','B'],
'deal': ['buy', 'buy', 'buy','buy','buy','sell','sell','sell','buy','buy','buy','sell','sell','sell','sell','sell','sell','buy'],
'amount':[1,2,3,4,10,8,1,1,3,2,5,2,2,6,3,3,4,5],
'price':[10,10,10,10,2,20,20,20,3,3,3,1,1,1,2,2,2,11]}
df = pd.DataFrame(data, index =string_date)
df
Out[245]:
stock deal amount price
2018-01-01 01:00:00 A buy 1 10
2018-01-01 01:00:00 A buy 2 10
2018-01-01 01:00:00 A buy 3 10
2018-01-01 01:00:00 A buy 4 10
2018-01-01 02:00:00 B buy 10 2
2018-01-01 03:00:00 A sell 8 20
2018-01-01 03:00:00 A sell 1 20
2018-01-01 03:00:00 A sell 1 20
2018-01-01 04:00:00 C buy 3 3
2018-01-01 04:00:00 C buy 2 3
2018-01-01 04:00:00 C buy 5 3
2018-01-01 07:00:00 B sell 2 1
2018-01-01 07:00:00 B sell 2 1
2018-01-01 07:00:00 B sell 6 1
2018-01-01 08:00:00 C sell 3 2
2018-01-01 08:00:00 C sell 3 2
2018-01-01 08:00:00 C sell 4 2
2018-02-01 12:00:00 B buy 5 11
One desired output:
string_date2 =['2018-01-01 01:00:00',
'2018-01-01 02:00:00',
'2018-01-01 03:00:00',
'2018-01-01 04:00:00',
'2018-01-01 07:00:00',
'2018-01-01 08:00:00',
'2018-01-02 12:00:00',
]
data2 ={'stock': ['A','B', 'A', 'C', 'B','C','B'],
'deal': ['buy', 'buy','sell','buy','sell','sell','buy'],
'amount':[10,10,10,10,10,10,5],
'price':[10,2,20,3,1,2,11],
'ID': ['1', '2','1','3','2','3','4']
}
df2 = pd.DataFrame(data2, index =string_date2)
df2
Out[226]:
stock deal amount price ID
2018-01-01 01:00:00 A buy 10 10 1
2018-01-01 02:00:00 B buy 10 2 2
2018-01-01 03:00:00 A sell 10 20 1
2018-01-01 04:00:00 C buy 10 3 3
2018-01-01 07:00:00 B sell 10 1 2
2018-01-01 08:00:00 C sell 10 2 3
2018-01-02 12:00:00 B buy 5 11 4
Any ideas?
This solution assumes a 'Long Only' portfolio where short sales are not allowed. Once a position is opened for a given stock, the transaction is assigned a new trade ID. Increasing the position in that stock results in the same trade ID, as well as any sell transactions reducing the size of the position (including the final sale where the position quantity is reduced to zero). A subsequent buy transaction in that same stock results in a new trade ID.
In order to maintain consistent trade identifiers with a growing log of transactions, I created a class TradeTracker to track and assign trade identifiers for each transaction.
import numpy as np
import pandas as pd
# Create sample dataframe.
dates = [
'2018-01-01 01:00:00',
'2018-01-01 01:01:00',
'2018-01-01 01:02:00',
'2018-01-01 01:03:00',
'2018-01-01 02:00:00',
'2018-01-01 03:00:00',
'2018-01-01 03:01:00',
'2018-01-01 03:03:00',
'2018-01-01 04:00:00',
'2018-01-01 04:01:00',
'2018-01-01 04:02:00',
'2018-01-01 07:00:00',
'2018-01-01 07:01:00',
'2018-01-01 07:02:00',
'2018-01-01 08:00:00',
'2018-01-01 08:01:00',
'2018-01-01 08:02:00',
'2018-02-01 12:00:00',
'2018-03-01 12:00:00',
]
data = {
'stock': ['A','A','A','A','B','A','A','A','C','C','C','B','B','B','C','C','C','B','A'],
'deal': ['buy', 'buy', 'buy', 'buy', 'buy', 'sell', 'sell', 'sell', 'buy', 'buy', 'buy',
'sell', 'sell', 'sell', 'sell', 'sell', 'sell', 'buy', 'buy'],
'amount': [1, 2, 3, 4, 10, 8, 1, 1, 3, 2, 5, 2, 2, 6, 3, 3, 4, 5, 10],
'price': [10, 10, 10, 10, 2, 20, 20, 20, 3, 3, 3, 1, 1, 1, 2, 2, 2, 11, 15]
}
df = pd.DataFrame(data, index=pd.to_datetime(dates))
>>> df
stock deal amount price
2018-01-01 01:00:00 A buy 1 10
2018-01-01 01:01:00 A buy 2 10
2018-01-01 01:02:00 A buy 3 10
2018-01-01 01:03:00 A buy 4 10
2018-01-01 02:00:00 B buy 10 2
2018-01-01 03:00:00 A sell 8 20
2018-01-01 03:01:00 A sell 1 20
2018-01-01 03:03:00 A sell 1 20
2018-01-01 04:00:00 C buy 3 3
2018-01-01 04:01:00 C buy 2 3
2018-01-01 04:02:00 C buy 5 3
2018-01-01 07:00:00 B sell 2 1
2018-01-01 07:01:00 B sell 2 1
2018-01-01 07:02:00 B sell 6 1
2018-01-01 08:00:00 C sell 3 2
2018-01-01 08:01:00 C sell 3 2
2018-01-01 08:02:00 C sell 4 2
2018-02-01 12:00:00 B buy 5 11
2018-03-01 12:00:00 A buy 10 15
# Add `position` column representing the cumulative buys and sells for a given stock.
df['position'] = (
df
.assign(temp_amount=np.where(df['deal'].eq('buy'), df['amount'], -df['amount']))
.groupby(['stock'])['temp_amount']
.cumsum()
)
# Create a class to track trade identifiers and instantiate it.
class TradeTracker():
def __init__(self):
self.trade_counter = 0
self.trade_ids = {}
def get_trade_id(self, stock, position):
if position == 0:
trade_id = self.trade_ids.pop(stock)
elif stock not in self.trade_ids:
self.trade_counter += 1
self.trade_ids[stock] = trade_id = self.trade_counter
else:
trade_id = self.trade_ids[stock]
return trade_id
trade_tracker = TradeTracker()
# Add a `trade_id` column using our custom class in a list comprehension.
df['trade_id'] = [trade_tracker.get_trade_id(stock, position)
for stock, position in df[['stock', 'position']].to_numpy()]
>>> df
stock deal amount price position trade_id
2018-01-01 01:00:00 A buy 1 10 1 1
2018-01-01 01:01:00 A buy 2 10 3 1
2018-01-01 01:02:00 A buy 3 10 6 1
2018-01-01 01:03:00 A buy 4 10 10 1
2018-01-01 02:00:00 B buy 10 2 10 2
2018-01-01 03:00:00 A sell 8 20 2 1
2018-01-01 03:01:00 A sell 1 20 1 1
2018-01-01 03:03:00 A sell 1 20 0 1
2018-01-01 04:00:00 C buy 3 3 3 3
2018-01-01 04:01:00 C buy 2 3 5 3
2018-01-01 04:02:00 C buy 5 3 10 3
2018-01-01 07:00:00 B sell 2 1 8 2
2018-01-01 07:01:00 B sell 2 1 6 2
2018-01-01 07:02:00 B sell 6 1 0 2
2018-01-01 08:00:00 C sell 3 2 7 3
2018-01-01 08:01:00 C sell 3 2 4 3
2018-01-01 08:02:00 C sell 4 2 0 3
2018-02-01 12:00:00 B buy 5 11 5 4
2018-03-01 12:00:00 A buy 10 15 10 5
Changed your string_date to this:
In [2295]: string_date =['2018-01-01 01:00:00',
...: '2018-01-01 01:00:00',
...: '2018-01-01 01:00:00',
...: '2018-01-01 01:00:00',
...: '2018-01-01 02:00:00',
...: '2018-01-01 03:00:00',
...: '2018-01-01 03:00:00',
...: '2018-01-01 03:00:00',
...: '2018-01-01 04:00:00',
...: '2018-01-01 04:00:00',
...: '2018-01-01 04:00:00',
...: '2018-01-01 07:00:00',
...: '2018-01-01 07:00:00',
...: '2018-01-01 07:00:00',
...: '2018-01-01 08:00:00',
...: '2018-01-01 08:00:00',
...: '2018-01-01 08:00:00',
...: '2018-02-01 12:00:00',
...: ]
...:
So df now is:
In [2297]: df
Out[2297]:
stock deal amount price
2018-01-01 01:00:00 A buy 1 10
2018-01-01 01:00:00 A buy 2 10
2018-01-01 01:00:00 A buy 3 10
2018-01-01 01:00:00 A buy 4 10
2018-01-01 02:00:00 B buy 10 2
2018-01-01 03:00:00 A sell 8 20
2018-01-01 03:00:00 A sell 1 20
2018-01-01 03:00:00 A sell 1 20
2018-01-01 04:00:00 C buy 3 3
2018-01-01 04:00:00 C buy 2 3
2018-01-01 04:00:00 C buy 5 3
2018-01-01 07:00:00 B sell 2 1
2018-01-01 07:00:00 B sell 2 1
2018-01-01 07:00:00 B sell 6 1
2018-01-01 08:00:00 C sell 3 2
2018-01-01 08:00:00 C sell 3 2
2018-01-01 08:00:00 C sell 4 2
2018-02-01 12:00:00 B buy 5 11
You can use Groupby.agg:
In [2302]: x = df.reset_index().groupby(['index', 'stock', 'deal'], as_index=False).agg({'amount': 'sum', 'price': 'max'}).set_index('index')
In [2303]: m = x['deal'] == 'buy'
In [2305]: x['ID'] = m.cumsum().where(m)
In [2307]: x['ID'] = x.groupby('stock')['ID'].ffill()
In [2308]: x
Out[2308]:
stock deal amount price ID
index
2018-01-01 01:00:00 A buy 10 10 1.0
2018-01-01 02:00:00 B buy 10 2 2.0
2018-01-01 03:00:00 A sell 10 20 1.0
2018-01-01 04:00:00 C buy 10 3 3.0
2018-01-01 07:00:00 B sell 10 1 2.0
2018-01-01 08:00:00 C sell 10 2 3.0
2018-02-01 12:00:00 B buy 5 11 4.0
I have pulled some data from the internet which is basically 2 columns of hourly data for a whole year:
france.GetData(base_scenario, utils.enumerate_periods(start,end,'H','CET'))
output
2015-12-31 23:00:00+00:00 23.86
2016-01-01 00:00:00+00:00 22.39
2016-01-01 01:00:00+00:00 20.59
2016-01-01 02:00:00+00:00 16.81
2016-01-01 03:00:00+00:00 17.41
2016-01-01 04:00:00+00:00 17.02
2016-01-01 05:00:00+00:00 15.86...
I want to add two more columns basically 'peak' hour and an 'off peak' hour scaler columns. So if the times of the day are between 0800 and 1800 there will be a 1 in the peak column and if outside these hours there will be a 1 in the off peak column.
Could anyone please explain how to do this.
Many thanks
I think you can use to_datetime if not DatetimeIndex, then use between_time to column peak and tested for notnull - if NaN get False and if some value get True. Then boolean values are converted to int (False -> 0 and True -> 1) by astype and last from column peak get peak-off (thanks Quickbeam2k1):
df = pd.DataFrame({'col': {'2016-01-01 01:00:00+00:00': 20.59, '2016-01-01 07:00:00+00:00': 15.86, '2016-01-01 10:00:00+00:00': 15.86, '2016-01-01 09:00:00+00:00': 15.86, '2016-01-01 02:00:00+00:00': 16.81, '2016-01-01 03:00:00+00:00': 17.41, '2016-01-01 05:00:00+00:00': 15.86, '2016-01-01 04:00:00+00:00': 17.02, '2016-01-01 08:00:00+00:00': 15.86, '2015-12-31 23:00:00+00:00': 23.86, '2016-01-01 18:00:00+00:00': 15.86, '2016-01-01 06:00:00+00:00': 15.86, '2016-01-01 00:00:00+00:00': 22.39}})
print (df)
col
2015-12-31 23:00:00+00:00 23.86
2016-01-01 00:00:00+00:00 22.39
2016-01-01 01:00:00+00:00 20.59
2016-01-01 02:00:00+00:00 16.81
2016-01-01 03:00:00+00:00 17.41
2016-01-01 04:00:00+00:00 17.02
2016-01-01 05:00:00+00:00 15.86
2016-01-01 06:00:00+00:00 15.86
2016-01-01 07:00:00+00:00 15.86
2016-01-01 08:00:00+00:00 15.86
2016-01-01 09:00:00+00:00 15.86
2016-01-01 10:00:00+00:00 15.86
2016-01-01 18:00:00+00:00 15.86
print (df.index)
Index(['2015-12-31 23:00:00+00:00', '2016-01-01 00:00:00+00:00',
'2016-01-01 01:00:00+00:00', '2016-01-01 02:00:00+00:00',
'2016-01-01 03:00:00+00:00', '2016-01-01 04:00:00+00:00',
'2016-01-01 05:00:00+00:00', '2016-01-01 06:00:00+00:00',
'2016-01-01 07:00:00+00:00', '2016-01-01 08:00:00+00:00',
'2016-01-01 09:00:00+00:00', '2016-01-01 10:00:00+00:00',
'2016-01-01 18:00:00+00:00'],
dtype='object')
df.index = pd.to_datetime(df.index)
print (df.index)
DatetimeIndex(['2015-12-31 23:00:00', '2016-01-01 00:00:00',
'2016-01-01 01:00:00', '2016-01-01 02:00:00',
'2016-01-01 03:00:00', '2016-01-01 04:00:00',
'2016-01-01 05:00:00', '2016-01-01 06:00:00',
'2016-01-01 07:00:00', '2016-01-01 08:00:00',
'2016-01-01 09:00:00', '2016-01-01 10:00:00',
'2016-01-01 18:00:00'],
dtype='datetime64[ns]', freq=None)
df['peak'] = df.between_time('08:00', '18:00')
df['peak'] = df['peak'].notnull().astype(int)
df['peak-off'] = -df['peak'] + 1
print (df)
col peak peak-off
2015-12-31 23:00:00 23.86 0 1
2016-01-01 00:00:00 22.39 0 1
2016-01-01 01:00:00 20.59 0 1
2016-01-01 02:00:00 16.81 0 1
2016-01-01 03:00:00 17.41 0 1
2016-01-01 04:00:00 17.02 0 1
2016-01-01 05:00:00 15.86 0 1
2016-01-01 06:00:00 15.86 0 1
2016-01-01 07:00:00 15.86 0 1
2016-01-01 08:00:00 15.86 1 0
2016-01-01 09:00:00 15.86 1 0
2016-01-01 10:00:00 15.86 1 0
2016-01-01 18:00:00 15.86 1 0
Another solution is if first get boolean mask by conditions and then convert it to int, for inverting mask use ~:
h1 = pd.datetime.strptime('08:00:00', '%H:%M:%S').time()
h2 = pd.datetime.strptime('18:00:00', '%H:%M:%S').time()
times = df.index.time
mask = (times >= h1) & (times <= h2)
df['peak'] = mask.astype(int)
df['peak-off'] = (~mask).astype(int)
print (df)
col peak peak-off
2015-12-31 23:00:00 23.86 0 1
2016-01-01 00:00:00 22.39 0 1
2016-01-01 01:00:00 20.59 0 1
2016-01-01 02:00:00 16.81 0 1
2016-01-01 03:00:00 17.41 0 1
2016-01-01 04:00:00 17.02 0 1
2016-01-01 05:00:00 15.86 0 1
2016-01-01 06:00:00 15.86 0 1
2016-01-01 07:00:00 15.86 0 1
2016-01-01 08:00:00 15.86 1 0
2016-01-01 09:00:00 15.86 1 0
2016-01-01 10:00:00 15.86 1 0
2016-01-01 18:00:00 15.86 1 0
If only hour data solution can be more simple - use DatetimeIndex.hour for mask:
df.index = pd.to_datetime(df.index)
print (df.index)
h = df.index.hour
mask = (h >= 8) & (h <= 18)
df['peak'] = mask.astype(int)
df['peak-off'] = (~mask).astype(int)
print (df)
col peak peak-off
2015-12-31 23:00:00 23.86 0 1
2016-01-01 00:00:00 22.39 0 1
2016-01-01 01:00:00 20.59 0 1
2016-01-01 02:00:00 16.81 0 1
2016-01-01 03:00:00 17.41 0 1
2016-01-01 04:00:00 17.02 0 1
2016-01-01 05:00:00 15.86 0 1
2016-01-01 06:00:00 15.86 0 1
2016-01-01 07:00:00 15.86 0 1
2016-01-01 08:00:00 15.86 1 0
2016-01-01 09:00:00 15.86 1 0
2016-01-01 10:00:00 15.86 1 0
2016-01-01 18:00:00 15.86 1 0
My DataFrame is
time NTCS001G002 NTCS001W005
0 2013-05-30 23:00:00 NaN NaN
1 2013-06-30 23:00:00 249 60
2 2013-07-31 23:00:00 161 2
3 2013-09-01 23:00:00 151 11
4 2013-09-04 23:00:00 14 0
5 2013-10-01 23:00:00 162 64
6 2013-11-01 00:00:00 281 175
7 2013-12-03 00:00:00 482 168
8 2014-01-02 00:00:00 378 NaN
9 2014-01-03 00:00:00 NaN NaN
10 2014-02-03 00:00:00 NaN 167
11 2014-03-03 00:00:00 502 167
When I iterate the rows like
for index, row in diffs.iterrows():
print "err", row.tolist()
[12 rows x 3 columns]
err [Timestamp('2013-05-30 23:00:00', tz=None), NaT, NaT]
err [Timestamp('2013-06-30 23:00:00', tz=None), 249.0, 60.0]
err [Timestamp('2013-07-31 23:00:00', tz=None), 161.0, 2.0]
err [Timestamp('2013-09-01 23:00:00', tz=None), 151.0, 11.0]
err [Timestamp('2013-09-04 23:00:00', tz=None), 14.0, 0.0]
err [Timestamp('2013-10-01 23:00:00', tz=None), 162.0, 64.0]
err [Timestamp('2013-11-01 00:00:00', tz=None), 281.0, 175.0]
err [Timestamp('2013-12-03 00:00:00', tz=None), 482.0, 168.0]
err [Timestamp('2014-01-02 00:00:00', tz=None), 378.0, nan]
err [Timestamp('2014-01-03 00:00:00', tz=None), NaT, NaT]
err [Timestamp('2014-02-03 00:00:00', tz=None), nan, 167.0]
err [Timestamp('2014-03-03 00:00:00', tz=None), 502.0, 167.0]
I am not sure if those NaT are a bug or not. I think they should be NaN
Can Pandas be made not to return NaT and if not how could I check against them as I will have to replace them in the list.
Thanks
The reason is that iterrows makes each row into a Series, and this row is cast to datetime64....
In [11]: pd.Series([pd.Timestamp('2014-01-03 00:00:00', tz=None), np.nan, np.nan])
Out[11]:
0 2014-01-03
1 NaT
2 NaT
dtype: datetime64[ns]
The value NaT means "Not A Time", the equivalent of nan for timestamp values.
Can you tell the dtypes of your data frame? Try casting the columns to float values.