Adding columns pandas series, based on conditional time series - python

I have pulled some data from the internet which is basically 2 columns of hourly data for a whole year:
france.GetData(base_scenario, utils.enumerate_periods(start,end,'H','CET'))
output
2015-12-31 23:00:00+00:00 23.86
2016-01-01 00:00:00+00:00 22.39
2016-01-01 01:00:00+00:00 20.59
2016-01-01 02:00:00+00:00 16.81
2016-01-01 03:00:00+00:00 17.41
2016-01-01 04:00:00+00:00 17.02
2016-01-01 05:00:00+00:00 15.86...
I want to add two more columns basically 'peak' hour and an 'off peak' hour scaler columns. So if the times of the day are between 0800 and 1800 there will be a 1 in the peak column and if outside these hours there will be a 1 in the off peak column.
Could anyone please explain how to do this.
Many thanks

I think you can use to_datetime if not DatetimeIndex, then use between_time to column peak and tested for notnull - if NaN get False and if some value get True. Then boolean values are converted to int (False -> 0 and True -> 1) by astype and last from column peak get peak-off (thanks Quickbeam2k1):
df = pd.DataFrame({'col': {'2016-01-01 01:00:00+00:00': 20.59, '2016-01-01 07:00:00+00:00': 15.86, '2016-01-01 10:00:00+00:00': 15.86, '2016-01-01 09:00:00+00:00': 15.86, '2016-01-01 02:00:00+00:00': 16.81, '2016-01-01 03:00:00+00:00': 17.41, '2016-01-01 05:00:00+00:00': 15.86, '2016-01-01 04:00:00+00:00': 17.02, '2016-01-01 08:00:00+00:00': 15.86, '2015-12-31 23:00:00+00:00': 23.86, '2016-01-01 18:00:00+00:00': 15.86, '2016-01-01 06:00:00+00:00': 15.86, '2016-01-01 00:00:00+00:00': 22.39}})
print (df)
col
2015-12-31 23:00:00+00:00 23.86
2016-01-01 00:00:00+00:00 22.39
2016-01-01 01:00:00+00:00 20.59
2016-01-01 02:00:00+00:00 16.81
2016-01-01 03:00:00+00:00 17.41
2016-01-01 04:00:00+00:00 17.02
2016-01-01 05:00:00+00:00 15.86
2016-01-01 06:00:00+00:00 15.86
2016-01-01 07:00:00+00:00 15.86
2016-01-01 08:00:00+00:00 15.86
2016-01-01 09:00:00+00:00 15.86
2016-01-01 10:00:00+00:00 15.86
2016-01-01 18:00:00+00:00 15.86
print (df.index)
Index(['2015-12-31 23:00:00+00:00', '2016-01-01 00:00:00+00:00',
'2016-01-01 01:00:00+00:00', '2016-01-01 02:00:00+00:00',
'2016-01-01 03:00:00+00:00', '2016-01-01 04:00:00+00:00',
'2016-01-01 05:00:00+00:00', '2016-01-01 06:00:00+00:00',
'2016-01-01 07:00:00+00:00', '2016-01-01 08:00:00+00:00',
'2016-01-01 09:00:00+00:00', '2016-01-01 10:00:00+00:00',
'2016-01-01 18:00:00+00:00'],
dtype='object')
df.index = pd.to_datetime(df.index)
print (df.index)
DatetimeIndex(['2015-12-31 23:00:00', '2016-01-01 00:00:00',
'2016-01-01 01:00:00', '2016-01-01 02:00:00',
'2016-01-01 03:00:00', '2016-01-01 04:00:00',
'2016-01-01 05:00:00', '2016-01-01 06:00:00',
'2016-01-01 07:00:00', '2016-01-01 08:00:00',
'2016-01-01 09:00:00', '2016-01-01 10:00:00',
'2016-01-01 18:00:00'],
dtype='datetime64[ns]', freq=None)
df['peak'] = df.between_time('08:00', '18:00')
df['peak'] = df['peak'].notnull().astype(int)
df['peak-off'] = -df['peak'] + 1
print (df)
col peak peak-off
2015-12-31 23:00:00 23.86 0 1
2016-01-01 00:00:00 22.39 0 1
2016-01-01 01:00:00 20.59 0 1
2016-01-01 02:00:00 16.81 0 1
2016-01-01 03:00:00 17.41 0 1
2016-01-01 04:00:00 17.02 0 1
2016-01-01 05:00:00 15.86 0 1
2016-01-01 06:00:00 15.86 0 1
2016-01-01 07:00:00 15.86 0 1
2016-01-01 08:00:00 15.86 1 0
2016-01-01 09:00:00 15.86 1 0
2016-01-01 10:00:00 15.86 1 0
2016-01-01 18:00:00 15.86 1 0
Another solution is if first get boolean mask by conditions and then convert it to int, for inverting mask use ~:
h1 = pd.datetime.strptime('08:00:00', '%H:%M:%S').time()
h2 = pd.datetime.strptime('18:00:00', '%H:%M:%S').time()
times = df.index.time
mask = (times >= h1) & (times <= h2)
df['peak'] = mask.astype(int)
df['peak-off'] = (~mask).astype(int)
print (df)
col peak peak-off
2015-12-31 23:00:00 23.86 0 1
2016-01-01 00:00:00 22.39 0 1
2016-01-01 01:00:00 20.59 0 1
2016-01-01 02:00:00 16.81 0 1
2016-01-01 03:00:00 17.41 0 1
2016-01-01 04:00:00 17.02 0 1
2016-01-01 05:00:00 15.86 0 1
2016-01-01 06:00:00 15.86 0 1
2016-01-01 07:00:00 15.86 0 1
2016-01-01 08:00:00 15.86 1 0
2016-01-01 09:00:00 15.86 1 0
2016-01-01 10:00:00 15.86 1 0
2016-01-01 18:00:00 15.86 1 0
If only hour data solution can be more simple - use DatetimeIndex.hour for mask:
df.index = pd.to_datetime(df.index)
print (df.index)
h = df.index.hour
mask = (h >= 8) & (h <= 18)
df['peak'] = mask.astype(int)
df['peak-off'] = (~mask).astype(int)
print (df)
col peak peak-off
2015-12-31 23:00:00 23.86 0 1
2016-01-01 00:00:00 22.39 0 1
2016-01-01 01:00:00 20.59 0 1
2016-01-01 02:00:00 16.81 0 1
2016-01-01 03:00:00 17.41 0 1
2016-01-01 04:00:00 17.02 0 1
2016-01-01 05:00:00 15.86 0 1
2016-01-01 06:00:00 15.86 0 1
2016-01-01 07:00:00 15.86 0 1
2016-01-01 08:00:00 15.86 1 0
2016-01-01 09:00:00 15.86 1 0
2016-01-01 10:00:00 15.86 1 0
2016-01-01 18:00:00 15.86 1 0

Related

Pandas calculate result dataframe from a dataframe of multiple trades at same timestamp

I have a dataframe containing trades with duplicated timestamps and buy and sell orders divided over several rows. In my example the total order amount is the sum over the same timestamp for that particular stock. I have created a simplified dataframe to show how the data looks like.
I would like to end up with an dataframe with results from the trades and a trading ID for each trades.
All trades are long positions, ie buy and try to sell at a higher price.
The ID column for the desired output df2 is answered in this thread Create ID column in a pandas dataframe
import pandas as pd
from datetime import datetime
import numpy as np
string_date =['2018-01-01 01:00:00',
'2018-01-01 01:00:00',
'2018-01-01 01:00:00',
'2018-01-01 01:00:00',
'2018-01-01 02:00:00',
'2018-01-01 03:00:00',
'2018-01-01 03:00:00',
'2018-01-01 03:00:00',
'2018-01-01 04:00:00',
'2018-01-01 04:00:00',
'2018-01-01 04:00:00',
'2018-01-01 07:00:00',
'2018-01-01 07:00:00',
'2018-01-01 07:00:00',
'2018-01-01 08:00:00',
'2018-01-01 08:00:00',
'2018-01-01 08:00:00',
'2018-02-01 12:00:00',
]
data ={'stock': ['A','A','A','A','B','A','A','A','C','C','C','B','B','B','C','C','C','B'],
'deal': ['buy', 'buy', 'buy','buy','buy','sell','sell','sell','buy','buy','buy','sell','sell','sell','sell','sell','sell','buy'],
'amount':[1,2,3,4,10,8,1,1,3,2,5,2,2,6,3,3,4,5],
'price':[10,10,10,10,2,20,20,20,3,3,3,1,1,1,2,2,2,11]}
df = pd.DataFrame(data, index =string_date)
df
Out[245]:
stock deal amount price
2018-01-01 01:00:00 A buy 1 10
2018-01-01 01:00:00 A buy 2 10
2018-01-01 01:00:00 A buy 3 10
2018-01-01 01:00:00 A buy 4 10
2018-01-01 02:00:00 B buy 10 2
2018-01-01 03:00:00 A sell 8 20
2018-01-01 03:00:00 A sell 1 20
2018-01-01 03:00:00 A sell 1 20
2018-01-01 04:00:00 C buy 3 3
2018-01-01 04:00:00 C buy 2 3
2018-01-01 04:00:00 C buy 5 3
2018-01-01 07:00:00 B sell 2 1
2018-01-01 07:00:00 B sell 2 1
2018-01-01 07:00:00 B sell 6 1
2018-01-01 08:00:00 C sell 3 2
2018-01-01 08:00:00 C sell 3 2
2018-01-01 08:00:00 C sell 4 2
2018-02-01 12:00:00 B buy 5 11
One desired output:
string_date2 =['2018-01-01 01:00:00',
'2018-01-01 02:00:00',
'2018-01-01 03:00:00',
'2018-01-01 04:00:00',
'2018-01-01 07:00:00',
'2018-01-01 08:00:00',
'2018-01-02 12:00:00',
]
data2 ={'stock': ['A','B', 'A', 'C', 'B','C','B'],
'deal': ['buy', 'buy','sell','buy','sell','sell','buy'],
'amount':[10,10,10,10,10,10,5],
'price':[10,2,20,3,1,2,11],
'ID': ['1', '2','1','3','2','3','4']
}
df2 = pd.DataFrame(data2, index =string_date2)
df2
Out[226]:
stock deal amount price ID
2018-01-01 01:00:00 A buy 10 10 1
2018-01-01 02:00:00 B buy 10 2 2
2018-01-01 03:00:00 A sell 10 20 1
2018-01-01 04:00:00 C buy 10 3 3
2018-01-01 07:00:00 B sell 10 1 2
2018-01-01 08:00:00 C sell 10 2 3
2018-01-02 12:00:00 B buy 5 11 4
Any ideas?
This solution assumes a 'Long Only' portfolio where short sales are not allowed. Once a position is opened for a given stock, the transaction is assigned a new trade ID. Increasing the position in that stock results in the same trade ID, as well as any sell transactions reducing the size of the position (including the final sale where the position quantity is reduced to zero). A subsequent buy transaction in that same stock results in a new trade ID.
In order to maintain consistent trade identifiers with a growing log of transactions, I created a class TradeTracker to track and assign trade identifiers for each transaction.
import numpy as np
import pandas as pd
# Create sample dataframe.
dates = [
'2018-01-01 01:00:00',
'2018-01-01 01:01:00',
'2018-01-01 01:02:00',
'2018-01-01 01:03:00',
'2018-01-01 02:00:00',
'2018-01-01 03:00:00',
'2018-01-01 03:01:00',
'2018-01-01 03:03:00',
'2018-01-01 04:00:00',
'2018-01-01 04:01:00',
'2018-01-01 04:02:00',
'2018-01-01 07:00:00',
'2018-01-01 07:01:00',
'2018-01-01 07:02:00',
'2018-01-01 08:00:00',
'2018-01-01 08:01:00',
'2018-01-01 08:02:00',
'2018-02-01 12:00:00',
'2018-03-01 12:00:00',
]
data = {
'stock': ['A','A','A','A','B','A','A','A','C','C','C','B','B','B','C','C','C','B','A'],
'deal': ['buy', 'buy', 'buy', 'buy', 'buy', 'sell', 'sell', 'sell', 'buy', 'buy', 'buy',
'sell', 'sell', 'sell', 'sell', 'sell', 'sell', 'buy', 'buy'],
'amount': [1, 2, 3, 4, 10, 8, 1, 1, 3, 2, 5, 2, 2, 6, 3, 3, 4, 5, 10],
'price': [10, 10, 10, 10, 2, 20, 20, 20, 3, 3, 3, 1, 1, 1, 2, 2, 2, 11, 15]
}
df = pd.DataFrame(data, index=pd.to_datetime(dates))
>>> df
stock deal amount price
2018-01-01 01:00:00 A buy 1 10
2018-01-01 01:01:00 A buy 2 10
2018-01-01 01:02:00 A buy 3 10
2018-01-01 01:03:00 A buy 4 10
2018-01-01 02:00:00 B buy 10 2
2018-01-01 03:00:00 A sell 8 20
2018-01-01 03:01:00 A sell 1 20
2018-01-01 03:03:00 A sell 1 20
2018-01-01 04:00:00 C buy 3 3
2018-01-01 04:01:00 C buy 2 3
2018-01-01 04:02:00 C buy 5 3
2018-01-01 07:00:00 B sell 2 1
2018-01-01 07:01:00 B sell 2 1
2018-01-01 07:02:00 B sell 6 1
2018-01-01 08:00:00 C sell 3 2
2018-01-01 08:01:00 C sell 3 2
2018-01-01 08:02:00 C sell 4 2
2018-02-01 12:00:00 B buy 5 11
2018-03-01 12:00:00 A buy 10 15
# Add `position` column representing the cumulative buys and sells for a given stock.
df['position'] = (
df
.assign(temp_amount=np.where(df['deal'].eq('buy'), df['amount'], -df['amount']))
.groupby(['stock'])['temp_amount']
.cumsum()
)
# Create a class to track trade identifiers and instantiate it.
class TradeTracker():
def __init__(self):
self.trade_counter = 0
self.trade_ids = {}
def get_trade_id(self, stock, position):
if position == 0:
trade_id = self.trade_ids.pop(stock)
elif stock not in self.trade_ids:
self.trade_counter += 1
self.trade_ids[stock] = trade_id = self.trade_counter
else:
trade_id = self.trade_ids[stock]
return trade_id
trade_tracker = TradeTracker()
# Add a `trade_id` column using our custom class in a list comprehension.
df['trade_id'] = [trade_tracker.get_trade_id(stock, position)
for stock, position in df[['stock', 'position']].to_numpy()]
>>> df
stock deal amount price position trade_id
2018-01-01 01:00:00 A buy 1 10 1 1
2018-01-01 01:01:00 A buy 2 10 3 1
2018-01-01 01:02:00 A buy 3 10 6 1
2018-01-01 01:03:00 A buy 4 10 10 1
2018-01-01 02:00:00 B buy 10 2 10 2
2018-01-01 03:00:00 A sell 8 20 2 1
2018-01-01 03:01:00 A sell 1 20 1 1
2018-01-01 03:03:00 A sell 1 20 0 1
2018-01-01 04:00:00 C buy 3 3 3 3
2018-01-01 04:01:00 C buy 2 3 5 3
2018-01-01 04:02:00 C buy 5 3 10 3
2018-01-01 07:00:00 B sell 2 1 8 2
2018-01-01 07:01:00 B sell 2 1 6 2
2018-01-01 07:02:00 B sell 6 1 0 2
2018-01-01 08:00:00 C sell 3 2 7 3
2018-01-01 08:01:00 C sell 3 2 4 3
2018-01-01 08:02:00 C sell 4 2 0 3
2018-02-01 12:00:00 B buy 5 11 5 4
2018-03-01 12:00:00 A buy 10 15 10 5
Changed your string_date to this:
In [2295]: string_date =['2018-01-01 01:00:00',
...: '2018-01-01 01:00:00',
...: '2018-01-01 01:00:00',
...: '2018-01-01 01:00:00',
...: '2018-01-01 02:00:00',
...: '2018-01-01 03:00:00',
...: '2018-01-01 03:00:00',
...: '2018-01-01 03:00:00',
...: '2018-01-01 04:00:00',
...: '2018-01-01 04:00:00',
...: '2018-01-01 04:00:00',
...: '2018-01-01 07:00:00',
...: '2018-01-01 07:00:00',
...: '2018-01-01 07:00:00',
...: '2018-01-01 08:00:00',
...: '2018-01-01 08:00:00',
...: '2018-01-01 08:00:00',
...: '2018-02-01 12:00:00',
...: ]
...:
So df now is:
In [2297]: df
Out[2297]:
stock deal amount price
2018-01-01 01:00:00 A buy 1 10
2018-01-01 01:00:00 A buy 2 10
2018-01-01 01:00:00 A buy 3 10
2018-01-01 01:00:00 A buy 4 10
2018-01-01 02:00:00 B buy 10 2
2018-01-01 03:00:00 A sell 8 20
2018-01-01 03:00:00 A sell 1 20
2018-01-01 03:00:00 A sell 1 20
2018-01-01 04:00:00 C buy 3 3
2018-01-01 04:00:00 C buy 2 3
2018-01-01 04:00:00 C buy 5 3
2018-01-01 07:00:00 B sell 2 1
2018-01-01 07:00:00 B sell 2 1
2018-01-01 07:00:00 B sell 6 1
2018-01-01 08:00:00 C sell 3 2
2018-01-01 08:00:00 C sell 3 2
2018-01-01 08:00:00 C sell 4 2
2018-02-01 12:00:00 B buy 5 11
You can use Groupby.agg:
In [2302]: x = df.reset_index().groupby(['index', 'stock', 'deal'], as_index=False).agg({'amount': 'sum', 'price': 'max'}).set_index('index')
In [2303]: m = x['deal'] == 'buy'
In [2305]: x['ID'] = m.cumsum().where(m)
In [2307]: x['ID'] = x.groupby('stock')['ID'].ffill()
In [2308]: x
Out[2308]:
stock deal amount price ID
index
2018-01-01 01:00:00 A buy 10 10 1.0
2018-01-01 02:00:00 B buy 10 2 2.0
2018-01-01 03:00:00 A sell 10 20 1.0
2018-01-01 04:00:00 C buy 10 3 3.0
2018-01-01 07:00:00 B sell 10 1 2.0
2018-01-01 08:00:00 C sell 10 2 3.0
2018-02-01 12:00:00 B buy 5 11 4.0

How to divide 60 mins datapoints into 15 mins?

I have a dataset with every 60 mins interval value. Now, I want to divide them into 15mins interval using the averages between those 2 hourly values. How do I do that?
Time A
2016-01-01 00:00:00 1
2016-01-01 01:00:00 5
2016-01-01 02:00:00 13
So, I now want it to be in 15mins interval with average values:
Time A
2016-01-01 00:00:00 1
2016-01-01 00:15:00 2 ### at 2016-01-01 00:00:00 values is 1 and
2016-01-01 00:30:00 3 ### at 2016-01-01 01:00:00 values is 5.
2016-01-01 00:45:00 4 ### Therefore we have to fill 4 values ( 15 mins interval )
2016-01-01 01:00:00 5 ### with the average of the hour values.
2016-01-01 01:15:00 7
2016-01-01 01:30:00 9
2016-01-01 01:45:00 11
2016-01-01 02:00:00 13
I tried resampling it with mean to 15 mins but it won't work ( obviously ) and it given Nan values. Can anyone help me out? on how to do it?
I would just resample: df.resample("15min").interpolate("linear")
As you have the column Time set as index already, it should directly work
We can do this in one line with resample, replace and interpolate:
df.resample('15min').sum().replace(0, np.NaN).interpolate()
Output
A
Time
2016-01-01 00:00:00 1.0
2016-01-01 00:15:00 2.0
2016-01-01 00:30:00 3.0
2016-01-01 00:45:00 4.0
2016-01-01 01:00:00 5.0
2016-01-01 01:15:00 7.0
2016-01-01 01:30:00 9.0
2016-01-01 01:45:00 11.0
2016-01-01 02:00:00 13.0
You can do that like this:
import pandas as pd
df = pd.DataFrame({
'Time': ["2016-01-01 00:00:00", "2016-01-01 01:00:00", "2016-01-01 02:00:00"],
'A': [1 , 5, 13]
})
df['Time'] = pd.to_datetime(df['Time'])
new_idx = pd.DatetimeIndex(start=df['Time'].iloc[0], end=df['Time'].iloc[-1], freq='15min')
df2 = df.set_index('Time').reindex(new_idx).interpolate().reset_index()
df2.rename(columns={'index': 'Time'}, inplace=True)
print(df2)
# Time A
# 0 2016-01-01 00:00:00 1.0
# 1 2016-01-01 00:15:00 2.0
# 2 2016-01-01 00:30:00 3.0
# 3 2016-01-01 00:45:00 4.0
# 4 2016-01-01 01:00:00 5.0
# 5 2016-01-01 01:15:00 7.0
# 6 2016-01-01 01:30:00 9.0
# 7 2016-01-01 01:45:00 11.0
# 8 2016-01-01 02:00:00 13.0
If you want column A in the result to be an integer you can add something like:
df2['A'] = df2['A'].round().astype(int)

Pandas - breaking the adding/subtracting of a cumsum() code in a pandas dataframe

I have a pandas df and with df['Battery capacity'] = df['total_load'].cumsum() + 5200
I subtract the values from "total_load" with the values from "Battery_capacity".
So, now I would like to add something to my code that breaks the adding/subtracting at a certain value. For example I don't want any higher values than 5200. So let's say at 13:00:00 the adding up should stop at 5200.
How could I implement that in my code? Scott Boston proposed an if-statement, but how would you do that with my code df['Battery capacity'] = df['total_load'].cumsum(if battery capacity = 5200, then stop adding) + 5200
Should I try to write a function?
Output should be something like that:
time total_load battery capacity
2016-06-01 12:00:00 2150 4487.7
2016-06-01 13:00:00 1200 5688 (but should stop at 5200)
2016-06-01 14:00:00 1980 5200 (don't actually add values now because we are still at 5200)
You can use np.clip to clip upper and lower bounds.
df['Battery capacity'] = np.clip(df['total_load'].cumsum() + 5200,-np.inf,5200)
Or as #jezrael points out Pandas Series has clip method:
df['Battery capacity'] = (df['total_load'].cumsum() + 5200).clip(-np.inf,5200)
Output:
Battery capacity total_load
2016-01-01 00:00:00 4755.0000 -445.0000
2016-01-01 01:00:00 4375.0000 -380.0000
2016-01-01 02:00:00 4025.0000 -350.0000
2016-01-01 03:00:00 3685.0000 -340.0000
2016-01-01 04:00:00 2955.4500 -729.5500
2016-01-01 05:00:00 1870.4500 -1085.0000
2016-01-01 06:00:00 879.1500 -991.3000
2016-01-01 07:00:00 -2555.8333 -3434.9833
2016-01-01 08:00:00 -1952.7503 603.0830
2016-01-01 09:00:00 -864.7503 1088.0000
2016-01-01 10:00:00 1155.2497 2020.0000
2016-01-01 11:00:00 2336.2497 1181.0000
2016-01-01 12:00:00 4486.2497 2150.0000
2016-01-01 13:00:00 5200.0000 1200.8330
2016-01-01 14:00:00 5200.0000 1980.0000
2016-01-01 15:00:00 5200.0000 -221.2667
Now, if you didn't want the value to go below zero replace -np.inf with 0.
Battery capacity total_load
2016-01-01 00:00:00 4755.0000 -445.0000
2016-01-01 01:00:00 4375.0000 -380.0000
2016-01-01 02:00:00 4025.0000 -350.0000
2016-01-01 03:00:00 3685.0000 -340.0000
2016-01-01 04:00:00 2955.4500 -729.5500
2016-01-01 05:00:00 1870.4500 -1085.0000
2016-01-01 06:00:00 879.1500 -991.3000
2016-01-01 07:00:00 0.0000 -3434.9833
2016-01-01 08:00:00 0.0000 603.0830
2016-01-01 09:00:00 0.0000 1088.0000
2016-01-01 10:00:00 1155.2497 2020.0000
2016-01-01 11:00:00 2336.2497 1181.0000
2016-01-01 12:00:00 4486.2497 2150.0000
2016-01-01 13:00:00 5200.0000 1200.8330
2016-01-01 14:00:00 5200.0000 1980.0000
2016-01-01 15:00:00 5200.0000 -221.2667

How can I organize data hour-by-hour and set the missing values to zeros?

I played games several times a day and I got a score each time. I would like to reorganize the data hour-by-hour, and set the missing values to zero.
Here is the original data:
import pandas as pd
df = pd.DataFrame({
'Time': ['2017-01-01 08:45:00', '2017-01-01 09:11:00',
'2017-01-01 11:40:00', '2017-01-01 14:05:00',
'2017-01-01 21:00:00'],
'Score': range(1, 6)})
It looks like this:
Score Time
0 1 2017-01-01 08:45:00
1 2 2017-01-01 09:11:00
2 3 2017-01-01 11:40:00
3 4 2017-01-01 14:05:00
4 5 2017-01-01 15:00:00
How can I get a new dataframe like this:
day Hour Score
2017-01-01 00:00:00 0
...
2017-01-01 08:00:00 1
2017-01-01 09:00:00 2
2017-01-01 10:00:00 0
2017-01-01 11:00:00 3
2017-01-01 12:00:00 0
2017-01-01 13:00:00 0
2017-01-01 14:00:00 4
2017-01-01 15:00:00 5
2017-01-01 16:00:00 0
...
2017-01-01 23:00:00 0
Many thanks!
You can use resample with some aggregate function like sum, then fillna and convert to to int by astype but first add first and last DateTime values:
df.loc[-1, 'Time'] = '2017-01-01 00:00:00'
df.loc[-2, 'Time'] = '2017-01-01 23:00:00'
df['Time'] = pd.to_datetime(df['Time'])
df = df.resample('H', on='Time').sum().fillna(0).astype(int)
print (df)
Score
Time
2017-01-01 00:00:00 0
2017-01-01 01:00:00 0
2017-01-01 02:00:00 0
2017-01-01 03:00:00 0
2017-01-01 04:00:00 0
2017-01-01 05:00:00 0
2017-01-01 06:00:00 0
2017-01-01 07:00:00 0
2017-01-01 08:00:00 1
2017-01-01 09:00:00 2
2017-01-01 10:00:00 0
2017-01-01 11:00:00 3
2017-01-01 12:00:00 0
2017-01-01 13:00:00 0
2017-01-01 14:00:00 4
2017-01-01 15:00:00 0
2017-01-01 16:00:00 0
2017-01-01 17:00:00 0
2017-01-01 18:00:00 0
2017-01-01 19:00:00 0
2017-01-01 20:00:00 0
2017-01-01 21:00:00 5
2017-01-01 22:00:00 0
2017-01-01 23:00:00 0

how to get the shifted index value of a dataframe in Pandas?

Consider the simple example below:
date = pd.date_range('1/1/2011', periods=5, freq='H')
df = pd.DataFrame({'cat' : ['A', 'A', 'A', 'B',
'B']}, index = date)
df
Out[278]:
cat
2011-01-01 00:00:00 A
2011-01-01 01:00:00 A
2011-01-01 02:00:00 A
2011-01-01 03:00:00 B
2011-01-01 04:00:00 B
I want to create a variable that contains the lagged/lead value of the index. That is something like:
df['index_shifted']=df.index.shift(1)
So, for instance, at time 2011-01-01 01:00:00 I expect the variable index_shifted to be 2011-01-01 00:00:00
How can I do that?
Thanks!
I think you need Index.shift with -1:
df['index_shifted']= df.index.shift(-1)
print (df)
cat index_shifted
2011-01-01 00:00:00 A 2010-12-31 23:00:00
2011-01-01 01:00:00 A 2011-01-01 00:00:00
2011-01-01 02:00:00 A 2011-01-01 01:00:00
2011-01-01 03:00:00 B 2011-01-01 02:00:00
2011-01-01 04:00:00 B 2011-01-01 03:00:00
For me it works without freq, but maybe it is necessary in real data:
df['index_shifted']= df.index.shift(-1, freq='H')
print (df)
cat index_shifted
2011-01-01 00:00:00 A 2010-12-31 23:00:00
2011-01-01 01:00:00 A 2011-01-01 00:00:00
2011-01-01 02:00:00 A 2011-01-01 01:00:00
2011-01-01 03:00:00 B 2011-01-01 02:00:00
2011-01-01 04:00:00 B 2011-01-01 03:00:00
EDIT:
If freq of DatetimeIndex is None, you need add freq to shift:
import pandas as pd
date = pd.date_range('1/1/2011', periods=5, freq='H').union(pd.date_range('5/1/2011', periods=5, freq='H'))
df = pd.DataFrame({'cat' : ['A', 'A', 'A', 'B',
'B','A', 'A', 'A', 'B',
'B']}, index = date)
print (df.index)
DatetimeIndex(['2011-01-01 00:00:00', '2011-01-01 01:00:00',
'2011-01-01 02:00:00', '2011-01-01 03:00:00',
'2011-01-01 04:00:00', '2011-05-01 00:00:00',
'2011-05-01 01:00:00', '2011-05-01 02:00:00',
'2011-05-01 03:00:00', '2011-05-01 04:00:00'],
dtype='datetime64[ns]', freq=None)
df['index_shifted']= df.index.shift(-1, freq='H')
print (df)
cat index_shifted
2011-01-01 00:00:00 A 2010-12-31 23:00:00
2011-01-01 01:00:00 A 2011-01-01 00:00:00
2011-01-01 02:00:00 A 2011-01-01 01:00:00
2011-01-01 03:00:00 B 2011-01-01 02:00:00
2011-01-01 04:00:00 B 2011-01-01 03:00:00
2011-05-01 00:00:00 A 2011-04-30 23:00:00
2011-05-01 01:00:00 A 2011-05-01 00:00:00
2011-05-01 02:00:00 A 2011-05-01 01:00:00
2011-05-01 03:00:00 B 2011-05-01 02:00:00
2011-05-01 04:00:00 B 2011-05-01 03:00:00
What's wrong with df['index_shifted']=df.index.shift(-1)?
(Genuine question, not sure if I missed something)
This is an old question, but if your timestamps have gaps or you do not want to specify the frequency, AND you are not dealing with timezones the following will work:
df['index_shifted'] = pd.Series(df.index).shift(-1).values
If you are dealing with Timezones the following will work:
df['index_shifted'] = pd.to_datetime(pd.Series(df.index).shift(-1).values, utc=True).tz_convert('America/New_York')

Categories