Pandas return NaT when is should not - python

My DataFrame is
time NTCS001G002 NTCS001W005
0 2013-05-30 23:00:00 NaN NaN
1 2013-06-30 23:00:00 249 60
2 2013-07-31 23:00:00 161 2
3 2013-09-01 23:00:00 151 11
4 2013-09-04 23:00:00 14 0
5 2013-10-01 23:00:00 162 64
6 2013-11-01 00:00:00 281 175
7 2013-12-03 00:00:00 482 168
8 2014-01-02 00:00:00 378 NaN
9 2014-01-03 00:00:00 NaN NaN
10 2014-02-03 00:00:00 NaN 167
11 2014-03-03 00:00:00 502 167
When I iterate the rows like
for index, row in diffs.iterrows():
print "err", row.tolist()
[12 rows x 3 columns]
err [Timestamp('2013-05-30 23:00:00', tz=None), NaT, NaT]
err [Timestamp('2013-06-30 23:00:00', tz=None), 249.0, 60.0]
err [Timestamp('2013-07-31 23:00:00', tz=None), 161.0, 2.0]
err [Timestamp('2013-09-01 23:00:00', tz=None), 151.0, 11.0]
err [Timestamp('2013-09-04 23:00:00', tz=None), 14.0, 0.0]
err [Timestamp('2013-10-01 23:00:00', tz=None), 162.0, 64.0]
err [Timestamp('2013-11-01 00:00:00', tz=None), 281.0, 175.0]
err [Timestamp('2013-12-03 00:00:00', tz=None), 482.0, 168.0]
err [Timestamp('2014-01-02 00:00:00', tz=None), 378.0, nan]
err [Timestamp('2014-01-03 00:00:00', tz=None), NaT, NaT]
err [Timestamp('2014-02-03 00:00:00', tz=None), nan, 167.0]
err [Timestamp('2014-03-03 00:00:00', tz=None), 502.0, 167.0]
I am not sure if those NaT are a bug or not. I think they should be NaN
Can Pandas be made not to return NaT and if not how could I check against them as I will have to replace them in the list.
Thanks

The reason is that iterrows makes each row into a Series, and this row is cast to datetime64....
In [11]: pd.Series([pd.Timestamp('2014-01-03 00:00:00', tz=None), np.nan, np.nan])
Out[11]:
0 2014-01-03
1 NaT
2 NaT
dtype: datetime64[ns]

The value NaT means "Not A Time", the equivalent of nan for timestamp values.
Can you tell the dtypes of your data frame? Try casting the columns to float values.

Related

Sum hourly values between 2 dates in pandas

I have a df like this:
DATE PP
0 2011-12-20 07:00:00 0.0
1 2011-12-20 08:00:00 0.0
2 2011-12-20 09:00:00 2.0
3 2011-12-20 10:00:00 0.0
4 2011-12-20 11:00:00 0.0
5 2011-12-20 12:00:00 0.0
6 2011-12-20 13:00:00 0.0
7 2011-12-20 14:00:00 5.0
8 2011-12-20 15:00:00 0.0
9 2011-12-20 16:00:00 0.0
10 2011-12-20 17:00:00 2.0
11 2011-12-20 18:00:00 0.0
12 2011-12-20 19:00:00 0.0
13 2011-12-20 20:00:00 1.0
14 2011-12-20 21:00:00 0.0
15 2011-12-20 22:00:00 0.0
16 2011-12-20 23:00:00 0.0
17 2011-12-21 00:00:00 0.0
18 2011-12-21 01:00:00 3.0
19 2011-12-21 02:00:00 0.0
20 2011-12-21 03:00:00 0.0
21 2011-12-21 04:00:00 0.0
22 2011-12-21 05:00:00 0.0
23 2011-12-21 06:00:00 5.0
24 2011-12-21 07:00:00 0.0
... .... ... ...
75609 2020-08-05 16:00:00 0.0
75610 2020-08-05 19:00:00 0.0
[75614 rows x 2 columns]
I want the cumulative values of PP column between 2 specific hourly dates in different days. I want the sum of every 07:00:00 from one day to the 07:00:00 of the next day. For example i want the cumulative values of PP from 2011-12-20 07:00:00 to 2011-12-21 07:00:00:
Expected result:
DATE CUMULATIVE VALUES PP
0 2011-12-20 18
1 2011-12-21 5
2 2011-12-22 10
etc... etc... ...
I tried this:
df['DAY'] = df['DATE'].dt.strftime('%d')
cumulatives=pd.DataFrame(df.groupby(['DAY'])['PP'].sum())
But this only sums the entire day, not between 7:00:00 to 7:00:00 of days.
Data:
{'DATE': ['2011-12-20 07:00:00', '2011-12-20 08:00:00', '2011-12-20 09:00:00',
'2011-12-20 10:00:00', '2011-12-20 11:00:00', '2011-12-20 12:00:00',
'2011-12-20 13:00:00', '2011-12-20 14:00:00', '2011-12-20 15:00:00',
'2011-12-20 16:00:00', '2011-12-20 17:00:00', '2011-12-20 18:00:00',
'2011-12-20 19:00:00', '2011-12-20 20:00:00', '2011-12-20 21:00:00',
'2011-12-20 22:00:00', '2011-12-20 23:00:00', '2011-12-21 00:00:00',
'2011-12-21 01:00:00', '2011-12-21 02:00:00', '2011-12-21 03:00:00',
'2011-12-21 04:00:00', '2011-12-21 05:00:00', '2011-12-21 06:00:00',
'2011-12-21 07:00:00', '2020-08-05 16:00:00', '2020-08-05 19:00:00'],
'PP': [0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 5.0, 0.0, 0.0, 2.0, 0.0, 0.0, 1.0,
0.0, 0.0, 0.0, 0.0, 3.0, 0.0, 0.0, 0.0, 0.0, 5.0, 0.0, 0.0, 0.0]}
One way is to subtract 7hours from date so that each day starts at 17:00 of the previous day; then groupby.sum fetches the desired output:
df['DATE'] = pd.to_datetime(df['DATE'])
out = df.groupby(df['DATE'].sub(pd.to_timedelta('7h')).dt.date)['PP'].sum().reset_index(name='SUM')
Output:
DATE SUM
0 2011-12-20 18.0
1 2011-12-21 0.0
2 2020-08-05 0.0

Pandas calculate result dataframe from a dataframe of multiple trades at same timestamp

I have a dataframe containing trades with duplicated timestamps and buy and sell orders divided over several rows. In my example the total order amount is the sum over the same timestamp for that particular stock. I have created a simplified dataframe to show how the data looks like.
I would like to end up with an dataframe with results from the trades and a trading ID for each trades.
All trades are long positions, ie buy and try to sell at a higher price.
The ID column for the desired output df2 is answered in this thread Create ID column in a pandas dataframe
import pandas as pd
from datetime import datetime
import numpy as np
string_date =['2018-01-01 01:00:00',
'2018-01-01 01:00:00',
'2018-01-01 01:00:00',
'2018-01-01 01:00:00',
'2018-01-01 02:00:00',
'2018-01-01 03:00:00',
'2018-01-01 03:00:00',
'2018-01-01 03:00:00',
'2018-01-01 04:00:00',
'2018-01-01 04:00:00',
'2018-01-01 04:00:00',
'2018-01-01 07:00:00',
'2018-01-01 07:00:00',
'2018-01-01 07:00:00',
'2018-01-01 08:00:00',
'2018-01-01 08:00:00',
'2018-01-01 08:00:00',
'2018-02-01 12:00:00',
]
data ={'stock': ['A','A','A','A','B','A','A','A','C','C','C','B','B','B','C','C','C','B'],
'deal': ['buy', 'buy', 'buy','buy','buy','sell','sell','sell','buy','buy','buy','sell','sell','sell','sell','sell','sell','buy'],
'amount':[1,2,3,4,10,8,1,1,3,2,5,2,2,6,3,3,4,5],
'price':[10,10,10,10,2,20,20,20,3,3,3,1,1,1,2,2,2,11]}
df = pd.DataFrame(data, index =string_date)
df
Out[245]:
stock deal amount price
2018-01-01 01:00:00 A buy 1 10
2018-01-01 01:00:00 A buy 2 10
2018-01-01 01:00:00 A buy 3 10
2018-01-01 01:00:00 A buy 4 10
2018-01-01 02:00:00 B buy 10 2
2018-01-01 03:00:00 A sell 8 20
2018-01-01 03:00:00 A sell 1 20
2018-01-01 03:00:00 A sell 1 20
2018-01-01 04:00:00 C buy 3 3
2018-01-01 04:00:00 C buy 2 3
2018-01-01 04:00:00 C buy 5 3
2018-01-01 07:00:00 B sell 2 1
2018-01-01 07:00:00 B sell 2 1
2018-01-01 07:00:00 B sell 6 1
2018-01-01 08:00:00 C sell 3 2
2018-01-01 08:00:00 C sell 3 2
2018-01-01 08:00:00 C sell 4 2
2018-02-01 12:00:00 B buy 5 11
One desired output:
string_date2 =['2018-01-01 01:00:00',
'2018-01-01 02:00:00',
'2018-01-01 03:00:00',
'2018-01-01 04:00:00',
'2018-01-01 07:00:00',
'2018-01-01 08:00:00',
'2018-01-02 12:00:00',
]
data2 ={'stock': ['A','B', 'A', 'C', 'B','C','B'],
'deal': ['buy', 'buy','sell','buy','sell','sell','buy'],
'amount':[10,10,10,10,10,10,5],
'price':[10,2,20,3,1,2,11],
'ID': ['1', '2','1','3','2','3','4']
}
df2 = pd.DataFrame(data2, index =string_date2)
df2
Out[226]:
stock deal amount price ID
2018-01-01 01:00:00 A buy 10 10 1
2018-01-01 02:00:00 B buy 10 2 2
2018-01-01 03:00:00 A sell 10 20 1
2018-01-01 04:00:00 C buy 10 3 3
2018-01-01 07:00:00 B sell 10 1 2
2018-01-01 08:00:00 C sell 10 2 3
2018-01-02 12:00:00 B buy 5 11 4
Any ideas?
This solution assumes a 'Long Only' portfolio where short sales are not allowed. Once a position is opened for a given stock, the transaction is assigned a new trade ID. Increasing the position in that stock results in the same trade ID, as well as any sell transactions reducing the size of the position (including the final sale where the position quantity is reduced to zero). A subsequent buy transaction in that same stock results in a new trade ID.
In order to maintain consistent trade identifiers with a growing log of transactions, I created a class TradeTracker to track and assign trade identifiers for each transaction.
import numpy as np
import pandas as pd
# Create sample dataframe.
dates = [
'2018-01-01 01:00:00',
'2018-01-01 01:01:00',
'2018-01-01 01:02:00',
'2018-01-01 01:03:00',
'2018-01-01 02:00:00',
'2018-01-01 03:00:00',
'2018-01-01 03:01:00',
'2018-01-01 03:03:00',
'2018-01-01 04:00:00',
'2018-01-01 04:01:00',
'2018-01-01 04:02:00',
'2018-01-01 07:00:00',
'2018-01-01 07:01:00',
'2018-01-01 07:02:00',
'2018-01-01 08:00:00',
'2018-01-01 08:01:00',
'2018-01-01 08:02:00',
'2018-02-01 12:00:00',
'2018-03-01 12:00:00',
]
data = {
'stock': ['A','A','A','A','B','A','A','A','C','C','C','B','B','B','C','C','C','B','A'],
'deal': ['buy', 'buy', 'buy', 'buy', 'buy', 'sell', 'sell', 'sell', 'buy', 'buy', 'buy',
'sell', 'sell', 'sell', 'sell', 'sell', 'sell', 'buy', 'buy'],
'amount': [1, 2, 3, 4, 10, 8, 1, 1, 3, 2, 5, 2, 2, 6, 3, 3, 4, 5, 10],
'price': [10, 10, 10, 10, 2, 20, 20, 20, 3, 3, 3, 1, 1, 1, 2, 2, 2, 11, 15]
}
df = pd.DataFrame(data, index=pd.to_datetime(dates))
>>> df
stock deal amount price
2018-01-01 01:00:00 A buy 1 10
2018-01-01 01:01:00 A buy 2 10
2018-01-01 01:02:00 A buy 3 10
2018-01-01 01:03:00 A buy 4 10
2018-01-01 02:00:00 B buy 10 2
2018-01-01 03:00:00 A sell 8 20
2018-01-01 03:01:00 A sell 1 20
2018-01-01 03:03:00 A sell 1 20
2018-01-01 04:00:00 C buy 3 3
2018-01-01 04:01:00 C buy 2 3
2018-01-01 04:02:00 C buy 5 3
2018-01-01 07:00:00 B sell 2 1
2018-01-01 07:01:00 B sell 2 1
2018-01-01 07:02:00 B sell 6 1
2018-01-01 08:00:00 C sell 3 2
2018-01-01 08:01:00 C sell 3 2
2018-01-01 08:02:00 C sell 4 2
2018-02-01 12:00:00 B buy 5 11
2018-03-01 12:00:00 A buy 10 15
# Add `position` column representing the cumulative buys and sells for a given stock.
df['position'] = (
df
.assign(temp_amount=np.where(df['deal'].eq('buy'), df['amount'], -df['amount']))
.groupby(['stock'])['temp_amount']
.cumsum()
)
# Create a class to track trade identifiers and instantiate it.
class TradeTracker():
def __init__(self):
self.trade_counter = 0
self.trade_ids = {}
def get_trade_id(self, stock, position):
if position == 0:
trade_id = self.trade_ids.pop(stock)
elif stock not in self.trade_ids:
self.trade_counter += 1
self.trade_ids[stock] = trade_id = self.trade_counter
else:
trade_id = self.trade_ids[stock]
return trade_id
trade_tracker = TradeTracker()
# Add a `trade_id` column using our custom class in a list comprehension.
df['trade_id'] = [trade_tracker.get_trade_id(stock, position)
for stock, position in df[['stock', 'position']].to_numpy()]
>>> df
stock deal amount price position trade_id
2018-01-01 01:00:00 A buy 1 10 1 1
2018-01-01 01:01:00 A buy 2 10 3 1
2018-01-01 01:02:00 A buy 3 10 6 1
2018-01-01 01:03:00 A buy 4 10 10 1
2018-01-01 02:00:00 B buy 10 2 10 2
2018-01-01 03:00:00 A sell 8 20 2 1
2018-01-01 03:01:00 A sell 1 20 1 1
2018-01-01 03:03:00 A sell 1 20 0 1
2018-01-01 04:00:00 C buy 3 3 3 3
2018-01-01 04:01:00 C buy 2 3 5 3
2018-01-01 04:02:00 C buy 5 3 10 3
2018-01-01 07:00:00 B sell 2 1 8 2
2018-01-01 07:01:00 B sell 2 1 6 2
2018-01-01 07:02:00 B sell 6 1 0 2
2018-01-01 08:00:00 C sell 3 2 7 3
2018-01-01 08:01:00 C sell 3 2 4 3
2018-01-01 08:02:00 C sell 4 2 0 3
2018-02-01 12:00:00 B buy 5 11 5 4
2018-03-01 12:00:00 A buy 10 15 10 5
Changed your string_date to this:
In [2295]: string_date =['2018-01-01 01:00:00',
...: '2018-01-01 01:00:00',
...: '2018-01-01 01:00:00',
...: '2018-01-01 01:00:00',
...: '2018-01-01 02:00:00',
...: '2018-01-01 03:00:00',
...: '2018-01-01 03:00:00',
...: '2018-01-01 03:00:00',
...: '2018-01-01 04:00:00',
...: '2018-01-01 04:00:00',
...: '2018-01-01 04:00:00',
...: '2018-01-01 07:00:00',
...: '2018-01-01 07:00:00',
...: '2018-01-01 07:00:00',
...: '2018-01-01 08:00:00',
...: '2018-01-01 08:00:00',
...: '2018-01-01 08:00:00',
...: '2018-02-01 12:00:00',
...: ]
...:
So df now is:
In [2297]: df
Out[2297]:
stock deal amount price
2018-01-01 01:00:00 A buy 1 10
2018-01-01 01:00:00 A buy 2 10
2018-01-01 01:00:00 A buy 3 10
2018-01-01 01:00:00 A buy 4 10
2018-01-01 02:00:00 B buy 10 2
2018-01-01 03:00:00 A sell 8 20
2018-01-01 03:00:00 A sell 1 20
2018-01-01 03:00:00 A sell 1 20
2018-01-01 04:00:00 C buy 3 3
2018-01-01 04:00:00 C buy 2 3
2018-01-01 04:00:00 C buy 5 3
2018-01-01 07:00:00 B sell 2 1
2018-01-01 07:00:00 B sell 2 1
2018-01-01 07:00:00 B sell 6 1
2018-01-01 08:00:00 C sell 3 2
2018-01-01 08:00:00 C sell 3 2
2018-01-01 08:00:00 C sell 4 2
2018-02-01 12:00:00 B buy 5 11
You can use Groupby.agg:
In [2302]: x = df.reset_index().groupby(['index', 'stock', 'deal'], as_index=False).agg({'amount': 'sum', 'price': 'max'}).set_index('index')
In [2303]: m = x['deal'] == 'buy'
In [2305]: x['ID'] = m.cumsum().where(m)
In [2307]: x['ID'] = x.groupby('stock')['ID'].ffill()
In [2308]: x
Out[2308]:
stock deal amount price ID
index
2018-01-01 01:00:00 A buy 10 10 1.0
2018-01-01 02:00:00 B buy 10 2 2.0
2018-01-01 03:00:00 A sell 10 20 1.0
2018-01-01 04:00:00 C buy 10 3 3.0
2018-01-01 07:00:00 B sell 10 1 2.0
2018-01-01 08:00:00 C sell 10 2 3.0
2018-02-01 12:00:00 B buy 5 11 4.0

Python Pandas time difference from the start of every day

I've got the following data frame on pandas:
d = {'col_Date_Time': ['2020-08-01 00:00:00',
'2020-08-01 00:10:00',
'2020-08-01 00:15:00',
'2020-08-01 00:19:00',
'2020-08-01 01:19:00',
'2020-08-02 00:00:00',
'2020-08-02 00:15:00',
'2020-08-02 00:35:00',
'2020-08-02 01:35:00']}
df = pd.DataFrame(data=d)
df = pd.to_datetime(df.col_Date_Time)
I want to add another column that contains the number of minutes from the start of each day.
So, the result in this case woud be:
NAN
10
15
19
79
NAN
15
35
95
import pandas as pd
import numpy as np
df = pd.DataFrame({'col_Date_Time': ['2020-08-01 00:00:00',
'2020-08-01 00:10:00',
'2020-08-01 00:15:00',
'2020-08-01 00:19:00',
'2020-08-01 01:23:00',
'2020-08-02 00:00:00',
'2020-08-02 00:15:00',
'2020-08-02 00:35:00',
'2020-08-02 06:31:00']})
df['col_Date_Time'] = pd.to_datetime(df.col_Date_Time)
df['start_day_time_stamp']=list(map(lambda x: x.date(),df['col_Date_Time']))
df['mins_from_day_start']=((pd.to_datetime(df['col_Date_Time'])-pd.to_datetime(df['start_day_time_stamp'])).dt.total_seconds())/60
df
Let us try
s = df.dt.minute.where(df.dt.date.duplicated())
Out[66]:
0 NaN
1 10.0
2 15.0
3 19.0
4 NaN
5 15.0
6 35.0
Name: col_Date_Time, dtype: float64
You can truncate the column to days (.dt.floor('d')), subtract this to col_Date_Time, and save in another column:
df["DELTA"] = df.col_Date_Time - df.col_Date_Time.dt.floor('d')
If you want this like integer:
df["DELTA2"] = df.DELTA.dt.seconds.div(60).astype(int)
col_Date_Time DELTA DELTA2
0 2020-08-01 00:00:00 00:00:00 0
1 2020-08-01 00:10:00 00:10:00 10
2 2020-08-01 00:15:00 00:15:00 15
3 2020-08-01 00:19:00 00:19:00 19
4 2020-08-01 01:19:00 01:19:00 79
5 2020-08-02 00:00:00 00:00:00 0
6 2020-08-02 00:15:00 00:15:00 15
7 2020-08-02 00:35:00 00:35:00 35
8 2020-08-02 01:35:00 01:35:00 95

Calculate difference of 2 dates in a pandas groupby object of the same 2 dates

I'm trying to create a new pandas.DataFrame column of the number of business days between two date columns. I'm unable to reference the dates in the date columns as arguments in a function call (I get a TypeError: Cannot convert input error). However, I'm able to zip the values in the series into a List and use a For Loop to reference the parameters. Ideally, I would prefer to create a GroupBy object from the two Date columns and calculate the difference.
Create DataFrame:
import pandas as pd
df = pd.DataFrame.from_dict({'Date1': ['2017-05-30 16:00:00',
'2017-05-30 16:00:00',
'2017-05-30 16:00:00'],
'Date2': ['2017-06-16 16:00:00',
'2017-07-21 16:00:00',
'2017-08-18 16:00:00'],
'Value1': [2.97, 3.3, 4.03],
'Value2': [96L, 14L, 2L]})
df['Date1'] = pd.to_datetime(df['Date1'])
df['Date2'] = pd.to_datetime(df['Date2'])
df.dtypes
Validate DataFrame:
Date1 datetime64[ns]
Date2 datetime64[ns]
Value1 float64
Value2 int64
dtype: object
Define function:
def date_diff(startDate, endDate):
return float(len(pd.bdate_range(startDate, endDate)) - 1)
Attempt to column from the result of the date_diff function call:
df['DateDiff'] = date_diff(df['Date1'], df['Date2'])
TypeError:
TypeError: Cannot convert input [0 2017-05-30 16:00:00
1 2017-05-30 16:00:00
2 2017-05-30 16:00:00
Name: Date1, dtype: datetime64[ns]] of type <class 'pandas.core.series.Series'> to Timestamp
A "For Loop" referencing a list of tuples containing the dates works:
date_List = list(zip(df['Date1'], df['Date2']))
for i in range(len(date_List)):
df.loc[(df['Date1'] == date_List[i][0]) & (df['Date2'] == date_List[i][1]), 'diff'] = date_diff(date_List[i][0], date_List[i][1])
Date1 Date2 Value1 Value2 diff
0 2017-05-30 16:00:00 2017-06-16 16:00:00 2.97 96 13.0
1 2017-05-30 16:00:00 2017-07-21 16:00:00 3.30 14 38.0
2 2017-05-30 16:00:00 2017-08-18 16:00:00 4.03 2 58.0
Ideally, I'd like to utilize a GroupBy object (by Date1 & Date2):
grp = df.groupby(['Date1', 'Date2'])
Desired Output:
[((Timestamp('2017-05-30 16:00:00'), Timestamp('2017-06-16 16:00:00')),
Date1 Date2 Value1 Value2 diff
0 2017-05-30 16:00:00 2017-06-16 16:00:00 2.97 96 13.0),
((Timestamp('2017-05-30 16:00:00'), Timestamp('2017-07-21 16:00:00')),
Date1 Date2 Value1 Value2 diff
1 2017-05-30 16:00:00 2017-07-21 16:00:00 3.3 14 38.0),
((Timestamp('2017-05-30 16:00:00'), Timestamp('2017-08-18 16:00:00')),
Date1 Date2 Value1 Value2 diff
2 2017-05-30 16:00:00 2017-08-18 16:00:00 4.03 2 58.0)]
You need a type cast to datetime64[D] to make numpy happy like:
Code:
import numpy as np
def date_diff(start_dates, end_dates):
return np.busday_count(
start_dates.values.astype('datetime64[D]'),
end_dates.values.astype('datetime64[D]'))
Test Code:
import pandas as pd
df = pd.DataFrame.from_dict({'Date1': ['2017-05-30 16:00:00',
'2017-05-30 16:00:00',
'2017-05-30 16:00:00'],
'Date2': ['2017-06-16 16:00:00',
'2017-07-21 16:00:00',
'2017-08-18 16:00:00'],
'Value1': [2.97, 3.3, 4.03],
'Value2': [96L, 14L, 2L]})
df['Date1'] = pd.to_datetime(df['Date1'])
df['Date2'] = pd.to_datetime(df['Date2'])
df['DateDiff'] = date_diff(df['Date1'], df['Date2'])
print(df)
Results:
Date1 Date2 Value1 Value2 DateDiff
0 2017-05-30 16:00:00 2017-06-16 16:00:00 2.97 96 13
1 2017-05-30 16:00:00 2017-07-21 16:00:00 3.30 14 38
2 2017-05-30 16:00:00 2017-08-18 16:00:00 4.03 2 58

Adding columns pandas series, based on conditional time series

I have pulled some data from the internet which is basically 2 columns of hourly data for a whole year:
france.GetData(base_scenario, utils.enumerate_periods(start,end,'H','CET'))
output
2015-12-31 23:00:00+00:00 23.86
2016-01-01 00:00:00+00:00 22.39
2016-01-01 01:00:00+00:00 20.59
2016-01-01 02:00:00+00:00 16.81
2016-01-01 03:00:00+00:00 17.41
2016-01-01 04:00:00+00:00 17.02
2016-01-01 05:00:00+00:00 15.86...
I want to add two more columns basically 'peak' hour and an 'off peak' hour scaler columns. So if the times of the day are between 0800 and 1800 there will be a 1 in the peak column and if outside these hours there will be a 1 in the off peak column.
Could anyone please explain how to do this.
Many thanks
I think you can use to_datetime if not DatetimeIndex, then use between_time to column peak and tested for notnull - if NaN get False and if some value get True. Then boolean values are converted to int (False -> 0 and True -> 1) by astype and last from column peak get peak-off (thanks Quickbeam2k1):
df = pd.DataFrame({'col': {'2016-01-01 01:00:00+00:00': 20.59, '2016-01-01 07:00:00+00:00': 15.86, '2016-01-01 10:00:00+00:00': 15.86, '2016-01-01 09:00:00+00:00': 15.86, '2016-01-01 02:00:00+00:00': 16.81, '2016-01-01 03:00:00+00:00': 17.41, '2016-01-01 05:00:00+00:00': 15.86, '2016-01-01 04:00:00+00:00': 17.02, '2016-01-01 08:00:00+00:00': 15.86, '2015-12-31 23:00:00+00:00': 23.86, '2016-01-01 18:00:00+00:00': 15.86, '2016-01-01 06:00:00+00:00': 15.86, '2016-01-01 00:00:00+00:00': 22.39}})
print (df)
col
2015-12-31 23:00:00+00:00 23.86
2016-01-01 00:00:00+00:00 22.39
2016-01-01 01:00:00+00:00 20.59
2016-01-01 02:00:00+00:00 16.81
2016-01-01 03:00:00+00:00 17.41
2016-01-01 04:00:00+00:00 17.02
2016-01-01 05:00:00+00:00 15.86
2016-01-01 06:00:00+00:00 15.86
2016-01-01 07:00:00+00:00 15.86
2016-01-01 08:00:00+00:00 15.86
2016-01-01 09:00:00+00:00 15.86
2016-01-01 10:00:00+00:00 15.86
2016-01-01 18:00:00+00:00 15.86
print (df.index)
Index(['2015-12-31 23:00:00+00:00', '2016-01-01 00:00:00+00:00',
'2016-01-01 01:00:00+00:00', '2016-01-01 02:00:00+00:00',
'2016-01-01 03:00:00+00:00', '2016-01-01 04:00:00+00:00',
'2016-01-01 05:00:00+00:00', '2016-01-01 06:00:00+00:00',
'2016-01-01 07:00:00+00:00', '2016-01-01 08:00:00+00:00',
'2016-01-01 09:00:00+00:00', '2016-01-01 10:00:00+00:00',
'2016-01-01 18:00:00+00:00'],
dtype='object')
df.index = pd.to_datetime(df.index)
print (df.index)
DatetimeIndex(['2015-12-31 23:00:00', '2016-01-01 00:00:00',
'2016-01-01 01:00:00', '2016-01-01 02:00:00',
'2016-01-01 03:00:00', '2016-01-01 04:00:00',
'2016-01-01 05:00:00', '2016-01-01 06:00:00',
'2016-01-01 07:00:00', '2016-01-01 08:00:00',
'2016-01-01 09:00:00', '2016-01-01 10:00:00',
'2016-01-01 18:00:00'],
dtype='datetime64[ns]', freq=None)
df['peak'] = df.between_time('08:00', '18:00')
df['peak'] = df['peak'].notnull().astype(int)
df['peak-off'] = -df['peak'] + 1
print (df)
col peak peak-off
2015-12-31 23:00:00 23.86 0 1
2016-01-01 00:00:00 22.39 0 1
2016-01-01 01:00:00 20.59 0 1
2016-01-01 02:00:00 16.81 0 1
2016-01-01 03:00:00 17.41 0 1
2016-01-01 04:00:00 17.02 0 1
2016-01-01 05:00:00 15.86 0 1
2016-01-01 06:00:00 15.86 0 1
2016-01-01 07:00:00 15.86 0 1
2016-01-01 08:00:00 15.86 1 0
2016-01-01 09:00:00 15.86 1 0
2016-01-01 10:00:00 15.86 1 0
2016-01-01 18:00:00 15.86 1 0
Another solution is if first get boolean mask by conditions and then convert it to int, for inverting mask use ~:
h1 = pd.datetime.strptime('08:00:00', '%H:%M:%S').time()
h2 = pd.datetime.strptime('18:00:00', '%H:%M:%S').time()
times = df.index.time
mask = (times >= h1) & (times <= h2)
df['peak'] = mask.astype(int)
df['peak-off'] = (~mask).astype(int)
print (df)
col peak peak-off
2015-12-31 23:00:00 23.86 0 1
2016-01-01 00:00:00 22.39 0 1
2016-01-01 01:00:00 20.59 0 1
2016-01-01 02:00:00 16.81 0 1
2016-01-01 03:00:00 17.41 0 1
2016-01-01 04:00:00 17.02 0 1
2016-01-01 05:00:00 15.86 0 1
2016-01-01 06:00:00 15.86 0 1
2016-01-01 07:00:00 15.86 0 1
2016-01-01 08:00:00 15.86 1 0
2016-01-01 09:00:00 15.86 1 0
2016-01-01 10:00:00 15.86 1 0
2016-01-01 18:00:00 15.86 1 0
If only hour data solution can be more simple - use DatetimeIndex.hour for mask:
df.index = pd.to_datetime(df.index)
print (df.index)
h = df.index.hour
mask = (h >= 8) & (h <= 18)
df['peak'] = mask.astype(int)
df['peak-off'] = (~mask).astype(int)
print (df)
col peak peak-off
2015-12-31 23:00:00 23.86 0 1
2016-01-01 00:00:00 22.39 0 1
2016-01-01 01:00:00 20.59 0 1
2016-01-01 02:00:00 16.81 0 1
2016-01-01 03:00:00 17.41 0 1
2016-01-01 04:00:00 17.02 0 1
2016-01-01 05:00:00 15.86 0 1
2016-01-01 06:00:00 15.86 0 1
2016-01-01 07:00:00 15.86 0 1
2016-01-01 08:00:00 15.86 1 0
2016-01-01 09:00:00 15.86 1 0
2016-01-01 10:00:00 15.86 1 0
2016-01-01 18:00:00 15.86 1 0

Categories