summation of pandas timestamp and array containing timedelta values - python

I've a start date and an array containing irregular sample values in days that I would like to use as date index for pandas series.
Like:
In [233]: date = pd.Timestamp('2015-10-17 08:00:00')
Out[233]: Timestamp('2015-10-17 08:00:00')
In [234]: sample_size = np.array([0,10,13,19,30])
Out[234]: array([ 0., 16., 32., 48., 64.])
Now I could make use of a list and the following for loop to create the pandas datetime series:
In [235]: all_dates = []
for stepsize in sample_size:
days = pd.Timedelta(stepsize, 'D')
all_dates.append(date + days)
pd.Series(all_dates)
Out[235]: 2015-10-17 08:00:00
2015-10-27 08:00:00
2015-10-30 08:00:00
2015-11-05 08:00:00
2015-11-16 08:00:00
dtype: datetime64[ns]
But I was hoping for a purely numpy or pandas solution without the need of a list and for loop

In [11]:
pd.Series(pd.TimedeltaIndex(sample_size , unit = 'D') + date)
Out[11]:
0 2015-10-17 08:00:00
1 2015-10-27 08:00:00
2 2015-10-30 08:00:00
3 2015-11-05 08:00:00
4 2015-11-16 08:00:00
dtype: datetime64[ns]
first you need to create a time delta of all values you want to add to your date , notice I've assigned D as a parameter which means we need the time delta frequency to be in days , because we want to add days to our date
In [42]:
time_delta = pd.TimedeltaIndex(sample_size, unit = 'D')
time_delta
Out[42]:
TimedeltaIndex(['0 days', '10 days', '13 days', '19 days', '30 days'], dtype='timedelta64[ns]', freq=None)
then in order to add your time delta to your date , you need to fulfill two conditions , first you need to create a timeseries of your date so that later you can add time delta to it , the second thing is that newly created timeseries must have the same number of elements of your timedelta , and this can be achieved by repeat(len(sample_size)
In [40]:
time_stamp = pd.Series(np.array(date).repeat(len(sample_size)))
time_stamp
Out[40]:
0 2015-10-17 08:00:00
1 2015-10-17 08:00:00
2 2015-10-17 08:00:00
3 2015-10-17 08:00:00
4 2015-10-17 08:00:00
dtype: datetime64[ns]
In [41]:
time_stamp + time_delta
Out[41]:
0 2015-10-17 08:00:00
1 2015-10-27 08:00:00
2 2015-10-30 08:00:00
3 2015-11-05 08:00:00
4 2015-11-16 08:00:00
dtype: datetime64[ns]

Related

is there a way to efficiently fill a pandas df column in python with hourly datetimes between two dates?

So I am looking for a way to fill an empty dataframe column with hourly values between two dates.
for example between
StartDate = 2019:01:01 00:00:00
to
EndDate = 2019:02:01 00:00:00
I would want a column that has
2019:01:01 00:00:00,2019:01:01 01:00:00,2019:02:01 00:00:00...
in Y:M:D H:M:S format.
I am not sure what the most efficient way of doing this is, is there a way to do it via pandas or would you have to use a for loop over a given timedelta between a range for eg?
`
Use date_range with DataFrame constructor:
StartDate = '2019-01-01 00:00:00'
EndDate = '2019-02-01 00:00:00'
df = pd.DataFrame({'dates':pd.date_range(StartDate, EndDate, freq='H')})
If there is custom format of dates first convert them to datetimes:
StartDate = '2019:01:01 00:00:00'
EndDate = '2019:02:01 00:00:00'
StartDate = pd.to_datetime(StartDate, format='%Y:%m:%d %H:%M:%S')
EndDate = pd.to_datetime(EndDate, format='%Y:%m:%d %H:%M:%S')
df = pd.DataFrame({'dates':pd.date_range(StartDate, EndDate, freq='H')})
print (df.head(10))
dates
0 2019-01-01 00:00:00
1 2019-01-01 01:00:00
2 2019-01-01 02:00:00
3 2019-01-01 03:00:00
4 2019-01-01 04:00:00
5 2019-01-01 05:00:00
6 2019-01-01 06:00:00
7 2019-01-01 07:00:00
8 2019-01-01 08:00:00
9 2019-01-01 09:00:00

Group every value within an hour into one value which is it's mean

I need a mean of all the values within that hour and I need to do it for all such hours for each day.
For Example:
Date Col1
2016-01-01 07:00:00 1
2016-01-01 07:05:00 2
2016-01-01 07:17:00 3
2016-01-01 08:13:00 2
2016-01-01 08:55:00 10
.
.
.
.
.
.
.
.
2016-12-31 22:00:00 3
2016-12-31 22:05:00 3
2016-12-31 23:13:00 4
2016-12-31 23:33:00 5
2016-12-31 23:53:00 6
So, I need to group all the values within that hour within that date into one ( it's mean ).
Expected Output:
Date Col1
2016-01-01 07:00:00 2 ##(2016-01-01 07:00:00, 07:05:00, 07:17:00) 3 values falls between the one hour range for that date i.e. 2016-01-01 07:00:00 - 2016-01-01 07:59:00, both inclusive.
2016-01-01 08:00:00 6
.
.
.
.
.
.
.
.
2016-12-31 22:00:00 3
2016-12-31 23:00:00 5
So, if I do it for whole year then in the end the total number of rows would be 365*24.
I tried solving using this answer but it doesn't work.
Can anyone help me?
resample from pandas should fit your case
import pandas as pd
df = pd.DataFrame({
'Date':['2016-01-01 07:00:00','2016-01-01 07:05:00',
'2016-01-01 07:17:00' ,'2016-01-01 08:13:00',
'2016-01-01 08:55:00','2016-12-31 22:00:00',
'2016-12-31 22:05:00','2016-12-31 23:13:00',
'2016-12-31 23:33:00','2016-12-31 23:53:00'],
'Col1':[1, 2, 3, 2, 10, 3, 3, 4, 5, 6]
})
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d') # Convert series to datetime type
df.set_index('Date', inplace=True) # Set Date column as index
# for every hour, take the mean for the remaining columns of the dataframe
# (in this case only for Col1, fill the NaN with 0 and reset the index)
df.resample('H').mean().fillna(0).reset_index()
df.head()
Date Col1
0 2016-01-01 07:00:00 2.0
1 2016-01-01 08:00:00 6.0
2 2016-01-01 09:00:00 0.0
3 2016-01-01 10:00:00 0.0
4 2016-01-01 11:00:00 0.0
Try groupby, dt.hour, mean, reset_index and assign:
print(df.groupby(df['Date'].dt.hour)['Col1'].mean().reset_index().assign(Date=df['Date']))
Output for first two rows:
Date Col1
0 2016-01-01 07:00:00 2
1 2016-01-01 07:05:00 6

Pandas Series of hour values to Series of dates

I have a time series covering January of 1979 with 6 hours time deltas. Time format is in continuous hour range:
1
7
13
18
25
31
.
.
.
739
Is it possible to convert these ints to dates? For instance:
1979/01/01 - 1:00
1979/01/01 - 7:00
1979/01/01 - 13:00
1979/01/01 - 18:00
1979/01/02 - 1:00
Thank you so much!
Setup
df = pd.DataFrame({'hour': [1,7,13,18,25,31]})
Use pd.to_datetime with the unit flag, and set the origin flag to the beginning of your desired year.
pd.to_datetime(df.hour, unit='h', origin='1979-01-01')
0 1979-01-01 01:00:00
1 1979-01-01 07:00:00
2 1979-01-01 13:00:00
3 1979-01-01 18:00:00
4 1979-01-02 01:00:00
5 1979-01-02 07:00:00
Name: hour, dtype: datetime64[ns]
Here is another way:
import pandas as pd
s = pd.Series([1,7,13])
s = pd.to_datetime(s*1e9*60*60+ pd.Timestamp(1979,1,1).value)
print(s)
Returns:
0 1979-01-01 01:00:00
1 1979-01-01 07:00:00
2 1979-01-01 13:00:00
dtype: datetime64[ns]
Could also just do this:
from datetime import datetime, timedelta
s = pd.Series([1,7,13,18,25])
s = s.apply(lambda h: datetime(1979, 1, 1) + timedelta(hours=h))
print(s)
Returns:
0 1979-01-01 01:00:00
1 1979-01-01 07:00:00
2 1979-01-01 13:00:00
3 1979-01-01 18:00:00
4 1979-01-02 01:00:00
dtype: datetime64[ns]

Fastest way to get rolling averages in pandas?

I have a list of nodes (about 2300 of them) that have hourly price data for about a year. I have a script that, for each node, loops through the times of the day to create a 4-hour trailing average, then groups the averages by month and hour. Finally, these hours in a month are averaged to give, for each month, a typical day of prices. I'm wondering if there is a faster way to do this because what I have seems to take a significant amount of time (about an hour). I also save the dataframes as csv files for later visualization (that's not the slow part).
df (before anything is done to it)
Price_Node_Name Local_Datetime_HourEnding Price Irrelevant_column
0 My-node 2016-08-17 01:00:00 20.95 EST
1 My-node 2016-08-17 02:00:00 21.45 EST
2 My-node 2016-08-17 03:00:00 25.60 EST
df_node (after the groupby as it looks going to csv)
Month Hour MA
1 0 23.55
1 1 23.45
1 2 21.63
for node in node_names:
df_node = df[df['Price_Node_Name'] == node]
df_node['MA'] = df_node['Price'].rolling(4).mean()
df_node = df_node.groupby([df_node['Local_Datetime_HourEnding'].dt.month,
df_node['Local_Datetime_HourEnding'].dt.hour]).mean()
df_node.to_csv('%s_rollingavg.csv' % node)
I get an weak error warning me about SetWithCopy, but I haven't quite figured out how to use .loc here since the column ['MA'] doesn't exist until I create it in this snippet and any way I can think of to create it before hand and fill it seems slower than what I have. Could be totally wrong though. Any help would be great.
python 3.6
edit: I might have misread the question here, hopefully this at least sparks some ideas for the solution.
I think it is useful to have the index as the datetime column when working with time series data in Pandas.
Here is some sample data:
Out[3]:
price
date
2015-01-14 00:00:00 155.427361
2015-01-14 01:00:00 205.285202
2015-01-14 02:00:00 205.305021
2015-01-14 03:00:00 195.000000
2015-01-14 04:00:00 213.102000
2015-01-14 05:00:00 214.500000
2015-01-14 06:00:00 222.544375
2015-01-14 07:00:00 227.090251
2015-01-14 08:00:00 227.700000
2015-01-14 09:00:00 243.456190
We use Series.rolling to create an MA column, i.e. we apply the method to the price column, with a two-period window, and call mean on the resulting rolling object:
In [4]: df['MA'] = df.price.rolling(window=2).mean()
In [5]: df
Out[5]:
price MA
date
2015-01-14 00:00:00 155.427361 NaN
2015-01-14 01:00:00 205.285202 180.356281
2015-01-14 02:00:00 205.305021 205.295111
2015-01-14 03:00:00 195.000000 200.152510
2015-01-14 04:00:00 213.102000 204.051000
2015-01-14 05:00:00 214.500000 213.801000
2015-01-14 06:00:00 222.544375 218.522187
2015-01-14 07:00:00 227.090251 224.817313
2015-01-14 08:00:00 227.700000 227.395125
2015-01-14 09:00:00 243.456190 235.578095
And if you want month and hour columns, can extract those from the index:
In [7]: df['month'] = df.index.month
In [8]: df['hour'] = df.index.hour
In [9]: df
Out[9]:
price MA month hour
date
2015-01-14 00:00:00 155.427361 NaN 1 0
2015-01-14 01:00:00 205.285202 180.356281 1 1
2015-01-14 02:00:00 205.305021 205.295111 1 2
2015-01-14 03:00:00 195.000000 200.152510 1 3
2015-01-14 04:00:00 213.102000 204.051000 1 4
2015-01-14 05:00:00 214.500000 213.801000 1 5
2015-01-14 06:00:00 222.544375 218.522187 1 6
2015-01-14 07:00:00 227.090251 224.817313 1 7
2015-01-14 08:00:00 227.700000 227.395125 1 8
2015-01-14 09:00:00 243.456190 235.578095 1 9
Then we can use groupby:
In [11]: df.groupby([
...: df['month'],
...: df['hour']
...: ]).mean()[['MA']]
Out[11]:
MA
month hour
1 0 NaN
1 180.356281
2 205.295111
3 200.152510
4 204.051000
5 213.801000
6 218.522187
7 224.817313
8 227.395125
9 235.578095
Here's a few things to try:
set 'Price_Node_name' to the index before the loop
df.set_index('Price_Node_name', inplace=True)
for node in node_names:
df_node = df[node]
use sort=False as a kwarg in the groupby
df_node.groupby(..., sort=False).mean()
Perform the rolling average AFTER the groupby, or don't do it at all--I don't think you need it in your case. Averaging the hourly totals for a month will give you the expected values for a typical day, which is what you desire. If you still want the rolling average, perform it on the averaged hourly totals for each month.

Groupby for datetime on a scale of hours (ignoring what day)

I have a series of floats with a datetimeindex that I have resampled into bins of 3 hours. As such I have an index containing
2015-01-01 09:00:00
2015-01-01 12:00:00
2015-01-01 15:00:00
2015-01-01 18:00:00
2015-01-01 21:00:00
2015-01-02 00:00:00
2015-01-02 03:00:00
2015-01-02 06:00:00
2015-01-02 09:00:00
and so forth. I am trying to sum the floats associated with each time of day, say 09:00:00, for all days.
The only way I can think to do it with my limited experience is to convert this series to a dataframe by using the date time index as another column, then running iterations to see if the hours slot of the date time is equal to one another than summing the values. I feel like this is horribly inefficient and probably not the 'correct' way to do this. Any help would be appreciated!
IIUC:
In [116]: s
Out[116]:
2015-01-01 09:00:00 3
2015-01-01 12:00:00 1
2015-01-01 15:00:00 0
2015-01-01 18:00:00 1
2015-01-01 21:00:00 0
2015-01-02 00:00:00 9
2015-01-02 03:00:00 2
2015-01-02 06:00:00 2
2015-01-02 09:00:00 7
2015-01-02 12:00:00 8
Freq: 3H, Name: val, dtype: int32
In [117]: s.groupby(s.index - s.index.normalize()).sum()
Out[117]:
00:00:00 9
03:00:00 2
06:00:00 2
09:00:00 10
12:00:00 9
15:00:00 0
18:00:00 1
21:00:00 0
Name: val, dtype: int32
or:
In [118]: s.groupby(s.index.hour).sum()
Out[118]:
0 9
3 2
6 2
9 10
12 9
15 0
18 1
21 0
Name: val, dtype: int32

Categories