I have data that I wish to groupby week.
I have been able to do this using the following
Data_Frame.groupby([pd.Grouper(freq='W')]).count()
this creates a dataframe in the form of
2018-01-07 ...
2018-01-14 ...
2018-01-21 ...
which is great. However I need it to start at 06:00, so something like
2018-01-07 06:00:00 ...
2018-01-14 06:00:00 ...
2018-01-21 06:00:00 ...
I am aware that I could shift my data by 6 hours but this seems like a cheat and I'm pretty sure Grouper comes with the functionality to do this (some way of specifying when it should start grouping).
I was hoping someone who know of a good method of doing this.
Many Thanks
edit:
I'm trying to use pythons actual in built functionality more since it often works much better and more consistently. I also turn the data itself into a graph with the timestamps as the y column and I would want the timestamp to actuality reflect the data, without some method such as shifting everything by 6 hours grouping it and then reshifting everything back 6 hours to get the right timestamp .
Use double shift:
np.random.seed(456)
idx = pd.date_range(start = '2018-01-07', end = '2018-01-09', freq = '2H')
df = pd.DataFrame({'a':np.random.randint(10, size=25)}, index=idx)
print (df)
a
2018-01-07 00:00:00 5
2018-01-07 02:00:00 9
2018-01-07 04:00:00 4
2018-01-07 06:00:00 5
2018-01-07 08:00:00 7
2018-01-07 10:00:00 1
2018-01-07 12:00:00 8
2018-01-07 14:00:00 3
2018-01-07 16:00:00 5
2018-01-07 18:00:00 2
2018-01-07 20:00:00 4
2018-01-07 22:00:00 2
2018-01-08 00:00:00 2
2018-01-08 02:00:00 8
2018-01-08 04:00:00 4
2018-01-08 06:00:00 8
2018-01-08 08:00:00 5
2018-01-08 10:00:00 6
2018-01-08 12:00:00 0
2018-01-08 14:00:00 9
2018-01-08 16:00:00 8
2018-01-08 18:00:00 2
2018-01-08 20:00:00 3
2018-01-08 22:00:00 6
2018-01-09 00:00:00 7
#freq='D' for easy check, in original use `W`
df1 = df.shift(-6, freq='H').groupby([pd.Grouper(freq='D')]).count().shift(6, freq='H')
print (df1)
a
2018-01-06 06:00:00 3
2018-01-07 06:00:00 12
2018-01-08 06:00:00 10
So to solve this one needs to use the base parameter for Grouper.
However the caveat is that whatever time period used (years, months, days etc..) for Freq, base will also be in it (from what I can tell).
So as I want to displace the starting position by 6 hours then my freq needs to be in hours rather than weeks (i.e. 1W = 168H).
So the solution I was looking for was
Data_Frame.groupby([pd.Grouper(freq='168H', base = 6)]).count()
This is simple, short, quick and works exactly as I want it to.
Thanks to all the other answers though
I would create another column with the required dates, and groupby on them
import pandas as pd
import numpy as np
selected_datetime = pd.date_range(start = '2018-01-07', end = '2018-01-30', freq = '1H')
df = pd.DataFrame(selected_datetime, columns = ['date'])
df['value1'] = np.random.rand(df.shape[0])
# specify the condition for your date, eg. starting from 6am
df['shift1'] = df['date'].apply(lambda x: x.date() if x.hour == 6 else np.nan)
# forward fill the na values to have last date
df['shift1'] = df['shift1'].fillna(method = 'ffill')
# you can groupby on this col
df.groupby('shift1')['value1'].mean()
Related
I have large df with datettime index with hourly time step and precipitation values in several columns. My precipitation valuesare a cumulative total during the day (from 1:00 am to 0:00 am of the next day) and are reset after every day, example:
datetime S1
2000-01-01 00:00:00 4.5 ...
2000-01-01 01:00:00 0 ...
2000-01-01 02:00:00 0 ...
2000-01-01 03:00:00 0 ...
2000-01-01 04:00:00 0
2000-01-01 05:00:00 0
2000-01-01 06:00:00 0
2000-01-01 07:00:00 0
2000-01-01 08:00:00 0
2000-01-01 09:00:00 0
2000-01-01 10:00:00 0
2000-01-01 11:00:00 6.5
2000-01-01 12:00:00 7.5
2000-01-01 13:00:00 8.7
2000-01-01 14:00:00 8.7
...
2000-01-01 22:00:00 8.7
2000-01-01 23:00:00 8.7
2000-01-02 00:00:00 8.7
2000-01-02 01:00:00 0
I am trying to go from this to the actual hourly values, so the value for 1:00 am for every day is fine and then I want to substract the value from the timestep before.
Can I somehow use if statement inside of df.apply?
I thought of smth like:
df_copy = df.copy()
df = df.apply(lambda x: if df.hour !=1: era5_T[x]=era5_T[x]-era5_T_copy[x-1])
But this is not working since I'm not calling a function? I could work with a for loop but that doesn't seem like the most efficient way as I'm working with a big dataset.
You can use numpy.where and pd.Series.shift to acheive the result
import numpy as np
df['hourly_S1'] = np.where(df.hour ==1, df.S1, df.S1-df.S1.shift())
Background: In mplfinance, I want to be able to plot multiple trade markers in the same bar. Currently to my understanding you can add only 1 (or 1 buy and 1 sell) to the same bar. I cannot have 2 more trades on the same side in the same bar unless I create another series.
Here is an example:
d = {'TradeDate': ['2018-10-15 06:00:00',
'2018-10-29 03:00:00',
'2018-10-29 03:00:00',
'2018-10-29 06:00:00',
'2018-11-15 05:00:00',
'2018-11-15 05:00:00',
'2018-11-15 05:00:00'],
'Price': [1.1596,
1.1433,
1.13926,
1.14015,
1.1413,
1.1400,
1.1403]}
df = pd.DataFrame(data=d)
df
TradeDate Price
0 2018-10-15 06:00:00 1.15960
1 2018-10-29 03:00:00 1.14330
2 2018-10-29 03:00:00 1.13926
3 2018-10-29 06:00:00 1.14015
4 2018-11-15 05:00:00 1.14130
5 2018-11-15 05:00:00 1.14000
6 2018-11-15 05:00:00 1.14030
As you can see there are multiple trades for 2 datetimes. Now I would like to apply a rule that says "If there is more than 1 trade(here: Price) per date, create a new column for the additional price, keep doing so until all prices for the same TradeDate (datetime) have been distributed across columns, and all datetimes are unique". So the more prices for the same date, the more extra columns are needed.
The end result would look like this (I finagled this data manually):
TradeDate Price Price2 Price3
0 2018-10-15 06:00:00 1.15960 NaN NaN
1 2018-10-29 03:00:00 1.14330 1.13926 NaN
3 2018-10-29 06:00:00 1.14015 NaN NaN
4 2018-11-15 05:00:00 1.14130 1.14000 1.1403
The trick is to add an incremental counter to each unique datetime. Such that if a datetime is encountered more than once, this counter increases.
To do this, we groupby tradedate, and get a cumulative count of the number of duplicate tradedates there are for a given tradedate. I then add 1 to this value so our counting starts at 1 instea of 0.
df["TradeDate_count"] = df.groupby("TradeDate").cumcount() + 1
print(df)
TradeDate Price TradeDate_count
0 2018-10-15 06:00:00 1.15960 1
1 2018-10-29 03:00:00 1.14330 1
2 2018-10-29 03:00:00 1.13926 2
3 2018-10-29 06:00:00 1.14015 1
4 2018-11-15 05:00:00 1.14130 1
5 2018-11-15 05:00:00 1.14000 2
6 2018-11-15 05:00:00 1.14030 3
Now that we've added that column, we can simply pivot to achieve your desired result. Note that I added a rename(...) method simply to add "price" to our column names. I also used the rename_axis method since our pivot returned us a named index for the columns which some users find hard to look at, so I figured it would be best to remove it.
new_df = (df.pivot(index="TradeDate", columns="TradeDate_count", values="Price")
.rename(columns="price{}".format)
.rename_axis(columns=None))
price1 price2 price3
TradeDate
2018-10-15 06:00:00 1.15960 NaN NaN
2018-10-29 03:00:00 1.14330 1.13926 NaN
2018-10-29 06:00:00 1.14015 NaN NaN
2018-11-15 05:00:00 1.14130 1.14000 1.1403
A slightly different approach is to group the data by the TradeDate and concatonate all the values into a list. This can then be pulled out into separate columns and assigned to a new dataframe.
reduce = df.groupby('TradeDate').agg(list)
new_df = pd.DataFrame(reduced['Price'].to_list(), index=reduced.index)
As per the other answer, if you wanted to rename for nicer comprehension you could do the following:
new_df.rename(columns=lambda x: f'Price{x if x > 0 else ""}', inplace=True)
I have the following table:
Hora_Retiro count_uses
0 00:00:18 1
1 00:00:34 1
2 00:02:27 1
3 00:03:13 1
4 00:06:45 1
... ... ...
748700 23:58:47 1
748701 23:58:49 1
748702 23:59:11 1
748703 23:59:47 1
748704 23:59:56 1
And I want to group all values within each hour, so I can see the total number of uses per hour (00:00:00 - 23:00:00)
I have the following code:
hora_pico_aug= hora_pico.groupby(pd.Grouper(key="Hora_Retiro",freq='H')).count()
Hora_Retiro column is of timedelta64[ns] type
Which gives the following output:
count_uses
Hora_Retiro
00:00:02 2566
01:00:02 602
02:00:02 295
03:00:02 5
04:00:02 10
05:00:02 4002
06:00:02 16075
07:00:02 39410
08:00:02 76272
09:00:02 56721
10:00:02 36036
11:00:02 32011
12:00:02 33725
13:00:02 41032
14:00:02 50747
15:00:02 50338
16:00:02 42347
17:00:02 54674
18:00:02 76056
19:00:02 57958
20:00:02 34286
21:00:02 22509
22:00:02 13894
23:00:02 7134
However, the index column starts at 00:00:02, and I want it to start at 00:00:00, and then go from one hour intervals. Something like this:
count_uses
Hora_Retiro
00:00:00 2565
01:00:00 603
02:00:00 295
03:00:00 5
04:00:00 10
05:00:00 4002
06:00:00 16075
07:00:00 39410
08:00:00 76272
09:00:00 56721
10:00:00 36036
11:00:00 32011
12:00:00 33725
13:00:00 41032
14:00:00 50747
15:00:00 50338
16:00:00 42347
17:00:00 54674
18:00:00 76056
19:00:00 57958
20:00:00 34286
21:00:00 22509
22:00:00 13894
23:00:00 7134
How can i make it to start at 00:00:00??
Thanks for the help!
You can create an hour column from Hora_Retiro column.
df['hour'] = df['Hora_Retiro'].dt.hour
And then groupby on the basis of hour
gpby_df = df.groupby('hour')['count_uses'].sum().reset_index()
gpby_df['hour'] = pd.to_datetime(gpby_df['hour'], format='%H').dt.time
gpby_df.columns = ['Hora_Retiro', 'sum_count_uses']
gpby_df
gives
Hora_Retiro sum_count_uses
0 00:00:00 14
1 09:00:00 1
2 10:00:00 2
3 20:00:00 2
I assume that Hora_Retiro column in your DataFrame is of
Timedelta type. It is not datetime, as in this case there
would be printed also the date part.
Indeed, your code creates groups starting at the minute / second
taken from the first row.
To group by "full hours":
round each element in this column to hour,
then group (just by this rounded value).
The code to do it is:
hora_pico.groupby(hora_pico.Hora_Retiro.apply(
lambda tt: tt.round('H'))).count_uses.count()
However I advise you to make up your mind, what do you want to count:
rows or values in count_uses column.
In the second case replace count function with sum.
I actually work on time series in Python 3 and Pandas and I want to make a synthesis of periods of contiguous missing values but I'm only able to find the indexes of nan values ...
Sample data :
Valeurs
2018-01-01 00:00:00 1.0
2018-01-01 04:00:00 NaN
2018-01-01 08:00:00 2.0
2018-01-01 12:00:00 NaN
2018-01-01 16:00:00 NaN
2018-01-01 20:00:00 5.0
2018-01-02 00:00:00 6.0
2018-01-02 04:00:00 7.0
2018-01-02 08:00:00 8.0
2018-01-02 12:00:00 9.0
2018-01-02 16:00:00 5.0
2018-01-02 20:00:00 NaN
2018-01-03 00:00:00 NaN
2018-01-03 04:00:00 NaN
2018-01-03 08:00:00 1.0
2018-01-03 12:00:00 2.0
2018-01-03 16:00:00 NaN
Expected results :
Start_Date number of contiguous missing values
2018-01-01 04:00:00 1
2018-01-01 12:00:00 2
2018-01-02 20:00:00 3
2018-01-03 16:00:00 1
How can i manage to obtain this type of results with pandas (shift(), cumsum(), groupby() ???)?
Thank you for your advice!
Sylvain
groupby and agg
mask = df.Valeurs.isna()
d = df.index.to_series()[mask].groupby((~mask).cumsum()[mask]).agg(['first', 'size'])
d.rename(columns=dict(size='num of contig null', first='Start_Date')).reset_index(drop=True)
Start_Date num of contig null
0 2018-01-01 04:00:00 1
1 2018-01-01 12:00:00 2
2 2018-01-02 20:00:00 3
3 2018-01-03 16:00:00 1
Working on the underlying numpy array:
a = df.Valeurs.values
m = np.concatenate(([False],np.isnan(a),[False]))
idx = np.nonzero(m[1:] != m[:-1])[0]
out = df[df.Valeurs.isnull() & ~df.Valeurs.shift().isnull()].index
pd.DataFrame({'Start date': out, 'contiguous': (idx[1::2] - idx[::2])})
Start date contiguous
0 2018-01-01 04:00:00 1
1 2018-01-01 12:00:00 2
2 2018-01-02 20:00:00 3
3 2018-01-03 16:00:00 1
If you have the indices where the values occur, you can use itertools as in this to find continuous chunks
I have a list of nodes (about 2300 of them) that have hourly price data for about a year. I have a script that, for each node, loops through the times of the day to create a 4-hour trailing average, then groups the averages by month and hour. Finally, these hours in a month are averaged to give, for each month, a typical day of prices. I'm wondering if there is a faster way to do this because what I have seems to take a significant amount of time (about an hour). I also save the dataframes as csv files for later visualization (that's not the slow part).
df (before anything is done to it)
Price_Node_Name Local_Datetime_HourEnding Price Irrelevant_column
0 My-node 2016-08-17 01:00:00 20.95 EST
1 My-node 2016-08-17 02:00:00 21.45 EST
2 My-node 2016-08-17 03:00:00 25.60 EST
df_node (after the groupby as it looks going to csv)
Month Hour MA
1 0 23.55
1 1 23.45
1 2 21.63
for node in node_names:
df_node = df[df['Price_Node_Name'] == node]
df_node['MA'] = df_node['Price'].rolling(4).mean()
df_node = df_node.groupby([df_node['Local_Datetime_HourEnding'].dt.month,
df_node['Local_Datetime_HourEnding'].dt.hour]).mean()
df_node.to_csv('%s_rollingavg.csv' % node)
I get an weak error warning me about SetWithCopy, but I haven't quite figured out how to use .loc here since the column ['MA'] doesn't exist until I create it in this snippet and any way I can think of to create it before hand and fill it seems slower than what I have. Could be totally wrong though. Any help would be great.
python 3.6
edit: I might have misread the question here, hopefully this at least sparks some ideas for the solution.
I think it is useful to have the index as the datetime column when working with time series data in Pandas.
Here is some sample data:
Out[3]:
price
date
2015-01-14 00:00:00 155.427361
2015-01-14 01:00:00 205.285202
2015-01-14 02:00:00 205.305021
2015-01-14 03:00:00 195.000000
2015-01-14 04:00:00 213.102000
2015-01-14 05:00:00 214.500000
2015-01-14 06:00:00 222.544375
2015-01-14 07:00:00 227.090251
2015-01-14 08:00:00 227.700000
2015-01-14 09:00:00 243.456190
We use Series.rolling to create an MA column, i.e. we apply the method to the price column, with a two-period window, and call mean on the resulting rolling object:
In [4]: df['MA'] = df.price.rolling(window=2).mean()
In [5]: df
Out[5]:
price MA
date
2015-01-14 00:00:00 155.427361 NaN
2015-01-14 01:00:00 205.285202 180.356281
2015-01-14 02:00:00 205.305021 205.295111
2015-01-14 03:00:00 195.000000 200.152510
2015-01-14 04:00:00 213.102000 204.051000
2015-01-14 05:00:00 214.500000 213.801000
2015-01-14 06:00:00 222.544375 218.522187
2015-01-14 07:00:00 227.090251 224.817313
2015-01-14 08:00:00 227.700000 227.395125
2015-01-14 09:00:00 243.456190 235.578095
And if you want month and hour columns, can extract those from the index:
In [7]: df['month'] = df.index.month
In [8]: df['hour'] = df.index.hour
In [9]: df
Out[9]:
price MA month hour
date
2015-01-14 00:00:00 155.427361 NaN 1 0
2015-01-14 01:00:00 205.285202 180.356281 1 1
2015-01-14 02:00:00 205.305021 205.295111 1 2
2015-01-14 03:00:00 195.000000 200.152510 1 3
2015-01-14 04:00:00 213.102000 204.051000 1 4
2015-01-14 05:00:00 214.500000 213.801000 1 5
2015-01-14 06:00:00 222.544375 218.522187 1 6
2015-01-14 07:00:00 227.090251 224.817313 1 7
2015-01-14 08:00:00 227.700000 227.395125 1 8
2015-01-14 09:00:00 243.456190 235.578095 1 9
Then we can use groupby:
In [11]: df.groupby([
...: df['month'],
...: df['hour']
...: ]).mean()[['MA']]
Out[11]:
MA
month hour
1 0 NaN
1 180.356281
2 205.295111
3 200.152510
4 204.051000
5 213.801000
6 218.522187
7 224.817313
8 227.395125
9 235.578095
Here's a few things to try:
set 'Price_Node_name' to the index before the loop
df.set_index('Price_Node_name', inplace=True)
for node in node_names:
df_node = df[node]
use sort=False as a kwarg in the groupby
df_node.groupby(..., sort=False).mean()
Perform the rolling average AFTER the groupby, or don't do it at all--I don't think you need it in your case. Averaging the hourly totals for a month will give you the expected values for a typical day, which is what you desire. If you still want the rolling average, perform it on the averaged hourly totals for each month.