Pandas grouping by start of the month with pd.Grouper - python

I have a DataFrame with hourly timestamps:
2019-01-01 0:00:00 1
2019-01-01 1:00:00 2
2019-01-11 3:00:00 1
2019-01-21 4:00:00 2
2019-02-01 0:00:00 1
2019-03-05 1:00:00 2
2019-03-21 3:00:00 1
2019-04-08 4:00:00 2
I am using the Pandas Grouper to group and sum the data monthly:
monthly_data = [pd.Grouper(freq='M', label='left')].sum()
Expected output:
2019-01-01 0:00:00 6
2019-02-01 0:00:00 1
2019-03-01 0:00:00 3
2019-04-01 0:00:00 2
Actual output:
2018-12-31 0:00:00 6
2019-01-31 0:00:00 1
2019-02-28 0:00:00 3
2019-03-30 0:00:00 2
How can I get the labels of the groups to be the first element in the group?
Thank you

Use the freq MS (Month Start), rather than M (Month End).
See dateoffset objects in the docs.

Use resample to aggregate on DatetimeIndex:
df.resample('MS').sum()
value
date
2019-01-01 6
2019-02-01 1
2019-03-01 3
2019-04-01 2

Related

Pandas : Dataframe Output System Down Time

I am a beginner of Python. These readings are extracted from sensors which report to system in every 20 mins interval. Now, I would like to find out the total downtime from the start time until end time recovered.
Original Data:
date, Quality Sensor Reading
1/1/2022 9:00 0
1/1/2022 9:20 0
1/1/2022 9:40 0
1/1/2022 10:00 0
1/1/2022 10:20 0
1/1/2022 10:40 0
1/1/2022 12:40 0
1/1/2022 13:00 0
1/1/2022 13:20 0
1/3/2022 1:20 0
1/3/2022 1:40 0
1/3/2022 2:00 0
1/4/2022 14:40 0
1/4/2022 15:00 0
1/4/2022 15:20 0
1/4/2022 17:20 0
1/4/2022 17:40 0
1/4/2022 18:00 0
1/4/2022 18:20 0
1/4/2022 18:40 0
The expected output are as below:
Quality Sensor = 0
Start_Time End_Time Total_Down_Time
2022-01-01 09:00:00 2022-01-01 10:40:00 100 minutes
2022-01-01 12:40:00 2022-01-01 13:20:00 40 minutes
2022-01-03 01:20:00 2022-01-03 02:00:00 40 minutes
2022-01-04 14:40:00 2022-01-04 15:20:00 40 minutes
2022-01-04 17:20:00 2022-01-04 18:40:00 80 minutes
First, let's break them into groups:
df.loc[df.date.diff().gt('00:20:00'), 'group'] = 1
df.group = df.group.cumsum().ffill().fillna(0)
Then, we can extract what we want from each group, and rename:
df2 = df.groupby('group')['date'].agg(['min', 'max']).reset_index(drop=True)
df2.columns = ['start_time', 'end_time']
Finally, we'll add the interval column and format it to minutes:
df2['down_time'] = df2.end_time.sub(df2.start_time)
# Optional, I wouldn't do this here:
df2.down_time = df2.down_time.dt.seconds/60
Output:
start_time end_time down_time
0 2022-01-01 09:00:00 2022-01-01 10:40:00 100.0
1 2022-01-01 12:40:00 2022-01-01 13:20:00 40.0
2 2022-01-03 01:20:00 2022-01-03 02:00:00 40.0
3 2022-01-04 14:40:00 2022-01-04 15:20:00 40.0
4 2022-01-04 17:20:00 2022-01-04 18:40:00 80.0
Let's say the dates are listed in the dataframe df under column date. You can use shift() to create a second column with the subsequent date/time, then create a third that has your duration by subtracting them. Something like:
df['date2'] = df['date'].shift(-1)
df['difference'] = df['date2'] - df['date']
You'll obviously have one row at the end that doesn't have a following value, and therefore doesn't have a difference.

Python - Sum of column values between 2 dates

I am trying to create a new column in my dataframe:
Let X be a variable number of days.
Date
Units Sold
Total Units sold in the last X days
0
2019-01-01 19:00:00
5
1
2019-01-01 15:00:00
4
2
2019-01-05 11:00:00
1
3
2019-01-12 12:00:00
3
4
2019-01-15 15:00:00
2
5
2019-02-04 18:00:00
7
For each row, I need to sum up units sold + all the units sold in the last 10 days (letting x = 10 days)
Desired Result:
Date
Units Sold
Total Units sold in the last X days
0
2019-01-01 19:00:00
5
5
1
2019-01-01 15:00:00
4
9
2
2019-01-05 11:00:00
1
10
3
2019-01-12 12:00:00
3
4
4
2019-01-15 15:00:00
2
6
5
2019-02-04 18:00:00
7
7
I have used the .rolling(window=) method before using periods and I think the following can help
df = df.rolling("10D").sum() but I can't get the syntax right!!
Please please help!
Try:
df["Total Units sold in the last 10 days"] = df.rolling(on="Date", window="10D", closed="both").sum()["Units Sold"]
print(df)
Prints:
Date Units Sold Total Units sold in the last 10 days
0 2019-01-01 5 5.0
1 2019-01-01 4 9.0
2 2019-01-05 1 10.0
3 2019-01-12 3 4.0
4 2019-01-15 2 6.0
5 2019-02-04 7 7.0

How to use pandas Grouper to get sum of values within each hour

I have the following table:
Hora_Retiro count_uses
0 00:00:18 1
1 00:00:34 1
2 00:02:27 1
3 00:03:13 1
4 00:06:45 1
... ... ...
748700 23:58:47 1
748701 23:58:49 1
748702 23:59:11 1
748703 23:59:47 1
748704 23:59:56 1
And I want to group all values within each hour, so I can see the total number of uses per hour (00:00:00 - 23:00:00)
I have the following code:
hora_pico_aug= hora_pico.groupby(pd.Grouper(key="Hora_Retiro",freq='H')).count()
Hora_Retiro column is of timedelta64[ns] type
Which gives the following output:
count_uses
Hora_Retiro
00:00:02 2566
01:00:02 602
02:00:02 295
03:00:02 5
04:00:02 10
05:00:02 4002
06:00:02 16075
07:00:02 39410
08:00:02 76272
09:00:02 56721
10:00:02 36036
11:00:02 32011
12:00:02 33725
13:00:02 41032
14:00:02 50747
15:00:02 50338
16:00:02 42347
17:00:02 54674
18:00:02 76056
19:00:02 57958
20:00:02 34286
21:00:02 22509
22:00:02 13894
23:00:02 7134
However, the index column starts at 00:00:02, and I want it to start at 00:00:00, and then go from one hour intervals. Something like this:
count_uses
Hora_Retiro
00:00:00 2565
01:00:00 603
02:00:00 295
03:00:00 5
04:00:00 10
05:00:00 4002
06:00:00 16075
07:00:00 39410
08:00:00 76272
09:00:00 56721
10:00:00 36036
11:00:00 32011
12:00:00 33725
13:00:00 41032
14:00:00 50747
15:00:00 50338
16:00:00 42347
17:00:00 54674
18:00:00 76056
19:00:00 57958
20:00:00 34286
21:00:00 22509
22:00:00 13894
23:00:00 7134
How can i make it to start at 00:00:00??
Thanks for the help!
You can create an hour column from Hora_Retiro column.
df['hour'] = df['Hora_Retiro'].dt.hour
And then groupby on the basis of hour
gpby_df = df.groupby('hour')['count_uses'].sum().reset_index()
gpby_df['hour'] = pd.to_datetime(gpby_df['hour'], format='%H').dt.time
gpby_df.columns = ['Hora_Retiro', 'sum_count_uses']
gpby_df
gives
Hora_Retiro sum_count_uses
0 00:00:00 14
1 09:00:00 1
2 10:00:00 2
3 20:00:00 2
I assume that Hora_Retiro column in your DataFrame is of
Timedelta type. It is not datetime, as in this case there
would be printed also the date part.
Indeed, your code creates groups starting at the minute / second
taken from the first row.
To group by "full hours":
round each element in this column to hour,
then group (just by this rounded value).
The code to do it is:
hora_pico.groupby(hora_pico.Hora_Retiro.apply(
lambda tt: tt.round('H'))).count_uses.count()
However I advise you to make up your mind, what do you want to count:
rows or values in count_uses column.
In the second case replace count function with sum.

equation that takes datetime into account

I am trying to set up a function with two different dictionaries.
datetime demand
0 2016-01-01 00:00:00 50.038
1 2016-01-01 00:00:10 50.021
2 2016-01-01 00:00:20 50.013
datetime dap
2016-01-01 00:00:00+01:00 23.86
2016-01-01 01:00:00+01:00 22.39
2016-01-01 02:00:00+01:00 20.59
As you can see, the dates are equal however the deltaT is different.
The function I have set up is as follows
for key, value in dap.items():
a = demand * value
print(a)
How do I make sure that in this function the dap value 23.86 is used for the datetime interval 2016-01-01 00:00:00 until 2016-01-01 01:00:00? This would mean that from the first dictionary indexed values 1-6 should be applied in the equation for 2016-01-01 00:00:00+01:00 23.86, and indexed values 7-12 are used for dap value 22.39 and so on?
datetime demand
0 2019-01-01 00:00:00 50.038
1 2019-01-01 00:00:10 50.021
2 2019-01-01 00:00:20 50.013
3 2019-01-01 00:00:30 50.004
4 2019-01-01 00:00:40 50.004
5 2019-01-01 00:00:50 50.009
6 2019-01-01 00:01:00 50.012
7 2019-01-01 00:01:10 49.998
8 2019-01-01 00:01:20 49.983
9 2019-01-01 00:01:30 49.979
10 2019-01-01 00:01:40 49.983
11 2019-01-01 00:01:50 49.983
12 2019-01-01 00:02:00 49.983

pandas time delta from grouped neighbors

I have a group of dates. I would like to subtract them from their forward neighbor to get the delta between them. My code look like this:
import pandas, numpy, StringIO
txt = '''ID,DATE
002691c9cec109e64558848f1358ac16,2003-08-13 00:00:00
002691c9cec109e64558848f1358ac16,2003-08-13 00:00:00
0088f218a1f00e0fe1b94919dc68ec33,2006-05-07 00:00:00
0088f218a1f00e0fe1b94919dc68ec33,2006-06-03 00:00:00
00d34668025906d55ae2e529615f530a,2006-03-09 00:00:00
00d34668025906d55ae2e529615f530a,2006-03-09 00:00:00
0101d3286dfbd58642a7527ecbddb92e,2007-10-13 00:00:00
0101d3286dfbd58642a7527ecbddb92e,2007-10-27 00:00:00
0103bd73af66e5a44f7867c0bb2203cc,2001-02-01 00:00:00
0103bd73af66e5a44f7867c0bb2203cc,2008-01-20 00:00:00
'''
df = pandas.read_csv(StringIO.StringIO(txt))
df = df.sort('DATE')
df.DATE = pandas.to_datetime(df.DATE)
grouped = df.groupby('ID')
df['X_SEQUENCE_GAP'] = pandas.concat([g['DATE'].sub(g['DATE'].shift(), fill_value=0) for title,g in grouped])
I am getting pretty incomprehensible results. So, I am going to go with I have a logic error.
The results I get are as follows:
ID DATE X_SEQUENCE_GAP
0 002691c9cec109e64558848f1358ac16 2003-08-13 00:00:00 12277 days, 00:00:00
1 002691c9cec109e64558848f1358ac16 2003-08-13 00:00:00 00:00:00
3 0088f218a1f00e0fe1b94919dc68ec33 2006-06-03 00:00:00 27 days, 00:00:00
2 0088f218a1f00e0fe1b94919dc68ec33 2006-05-07 00:00:00 13275 days, 00:00:00
5 00d34668025906d55ae2e529615f530a 2006-03-09 00:00:00 13216 days, 00:00:00
4 00d34668025906d55ae2e529615f530a 2006-03-09 00:00:00 00:00:00
6 0101d3286dfbd58642a7527ecbddb92e 2007-10-13 00:00:00 13799 days, 00:00:00
7 0101d3286dfbd58642a7527ecbddb92e 2007-10-27 00:00:00 14 days, 00:00:00
9 0103bd73af66e5a44f7867c0bb2203cc 2008-01-20 00:00:00 2544 days, 00:00:00
8 0103bd73af66e5a44f7867c0bb2203cc 2001-02-01 00:00:00 11354 days, 00:00:00
I was expecting for exapme that 0 and 1 would have both a 0 result. Any help is most appreciated.
This is in 0.11rc1 (I don't think will work on a prior version)
When you shift dates the first one is a NaT (like a nan, but for datetimes/timedeltas)
In [27]: df['X_SEQUENCE_GAP'] = grouped.apply(lambda g: g['DATE']-g['DATE'].shift())
In [30]: df.sort()
Out[30]:
ID DATE X_SEQUENCE_GAP
0 002691c9cec109e64558848f1358ac16 2003-08-13 00:00:00 NaT
1 002691c9cec109e64558848f1358ac16 2003-08-13 00:00:00 00:00:00
2 0088f218a1f00e0fe1b94919dc68ec33 2006-05-07 00:00:00 NaT
3 0088f218a1f00e0fe1b94919dc68ec33 2006-06-03 00:00:00 27 days, 00:00:00
4 00d34668025906d55ae2e529615f530a 2006-03-09 00:00:00 NaT
5 00d34668025906d55ae2e529615f530a 2006-03-09 00:00:00 00:00:00
6 0101d3286dfbd58642a7527ecbddb92e 2007-10-13 00:00:00 NaT
7 0101d3286dfbd58642a7527ecbddb92e 2007-10-27 00:00:00 14 days, 00:00:00
8 0103bd73af66e5a44f7867c0bb2203cc 2001-02-01 00:00:00 NaT
9 0103bd73af66e5a44f7867c0bb2203cc 2008-01-20 00:00:00 2544 days, 00:00:00
You can then fillna (but you have to do this ackward type conversion becuase of a numpy bug, will get fixed in 0.12).
In [57]: df['X_SEQUENCE_GAP'].sort_index().astype('timedelta64[ns]').fillna(0)
Out[57]:
0 00:00:00
1 00:00:00
2 00:00:00
3 27 days, 00:00:00
4 00:00:00
5 00:00:00
6 00:00:00
7 14 days, 00:00:00
8 00:00:00
9 2544 days, 00:00:00
Name: X_SEQUENCE_GAP, dtype: timedelta64[ns]

Categories