Groupby a column but have another column as the key - python

I have a df that looks like this (the df much larger)
DateTime Value Date Time period DatePeriod
0 2022-09-18 06:00:00 5.4 18/09/2022 06:00 morning 18/09/2022-morning
1 2022-09-18 07:00:00 6.0 18/09/2022 07:00 morning 18/09/2022-morning
2 2022-09-18 08:00:00 6.5 18/09/2022 08:00 morning 18/09/2022-morning
3 2022-09-18 09:00:00 6.7 18/09/2022 09:00 morning 18/09/2022-morning
4 2022-09-18 10:00:00 6.9 18/09/2022 10:00 morning 18/09/2022-morning
11 2022-09-18 17:00:00 6.8 18/09/2022 17:00 morning 18/09/2022-morning
12 2022-09-18 18:00:00 6.4 18/09/2022 18:00 night 18/09/2022-night
13 2022-09-18 19:00:00 5.7 18/09/2022 19:00 night 18/09/2022-night
14 2022-09-18 20:00:00 4.8 18/09/2022 20:00 night 18/09/2022-night
15 2022-09-18 21:00:00 5.4 18/09/2022 21:00 night 18/09/2022-night
16 2022-09-18 22:00:00 4.7 18/09/2022 22:00 night 19/09/2022-night
21 2022-09-19 03:00:00 3.8 19/09/2022 03:00 night 19/09/2022-night
22 2022-09-19 04:00:00 3.5 19/09/2022 04:00 night 19/09/2022-night
23 2022-09-19 05:00:00 2.8 19/09/2022 05:00 night 19/09/2022-night
24 2022-09-19 06:00:00 3.8 19/09/2022 06:00 morning 19/09/2022-morning
I created a dictionary by grouping the Dateperiod and collected their values in a list, like this:
result = df.groupby('DatePeriod')['Value'].apply(list).to_dict()
Output:
{'18/09/2022-morning': [5.4, 6.0, 6.5, 6.9, 7.9, 8.5, 7.5, 7.9, 7.8, 7.6, 6.8],
'18/09/2022-night': [6.4, 5.7, 4.8, 5.4, 4.7, 4.3],
'19/09/2022-morning': [3.8],
'19/09/2022-night': [4.1, 4.4, 4.3, 3.8, 3.5, 2.8]}
Is there anyway I can get the exact same result but with the DateTime as key instead of DatePeriod in result dictionary? i.e I still want the grouping to be based on the DatePeriod and the values to be a list of values,
only difference is i want the full Date to be the key, it can be the first DateTime as key, but not the DatePeriod! Example:
{'2022-09-18 06:00:00': [5.4, 6.0, 6.5, 6.9, 7.9, 8.5, 7.5, 7.9, 7.8, 7.6, 6.8],
'2022-09-18 18:00:00' : [6.4, 5.7, 4.8, 5.4, 4.7, 4.3],
'2022-09-19 06:00:00': [3.8],
'2022-09-19 03:00:00': [4.1, 4.4, 4.3, 3.8, 3.5, 2.8]}
Is there any easy way to do this?
Thanks in advance

IIUC you can use aggregation:
result = (df.groupby('DatePeriod')
.agg({"Value": list, "DateTime": "first"})
.set_index("DateTime")["Value"]
.to_dict())
print (result)
{'2022-05-12 06:00:00': [11.8], '2022-05-12 18:00:00': [12.5], '2022-05-13 06:00:00': [10.9], '2022-05-13 18:00:00': [13.5], '2022-05-14 06:00:00': [11.8]}

Related

Splitting Dataframe time into morning and evening

I have a df that looks like this (shortened):
DateTime Value Date Time
0 2022-09-18 06:00:00 5.4 18/09/2022 06:00
1 2022-09-18 07:00:00 6.0 18/09/2022 07:00
2 2022-09-18 08:00:00 6.5 18/09/2022 08:00
3 2022-09-18 09:00:00 6.7 18/09/2022 09:00
8 2022-09-18 14:00:00 7.9 18/09/2022 14:00
9 2022-09-18 15:00:00 7.8 18/09/2022 15:00
10 2022-09-18 16:00:00 7.6 18/09/2022 16:00
11 2022-09-18 17:00:00 6.8 18/09/2022 17:00
12 2022-09-18 18:00:00 6.4 18/09/2022 18:00
13 2022-09-18 19:00:00 5.7 18/09/2022 19:00
14 2022-09-18 20:00:00 4.8 18/09/2022 20:00
15 2022-09-18 21:00:00 5.4 18/09/2022 21:00
16 2022-09-18 22:00:00 4.7 18/09/2022 22:00
17 2022-09-18 23:00:00 4.3 18/09/2022 23:00
18 2022-09-19 00:00:00 4.1 19/09/2022 00:00
19 2022-09-19 01:00:00 4.4 19/09/2022 01:00
22 2022-09-19 04:00:00 3.5 19/09/2022 04:00
23 2022-09-19 05:00:00 2.8 19/09/2022 05:00
24 2022-09-19 06:00:00 3.8 19/09/2022 06:00
I want to create a new column where i split the between day and night like this:
00:00 - 05:00 night ,
06:00 - 18:00 day ,
19:00 - 23:00 night
But apparently one can't use same label? How can I solve this problem? Here is my code
df['period'] = pd.cut(pd.to_datetime(df.DateTime).dt.hour,
bins=[0, 5, 17, 23],
labels=['night', 'morning', 'night'],
include_lowest=True)
It's returning
ValueError: labels must be unique if ordered=True; pass ordered=False for duplicate labels
if i understood correctly, if time is between 00:00 - 05:00 or 19:00 - 23:00, you want your new column to say 'night', else 'day', well here's that code:
df['day/night'] = df['Time'].apply(lambda x: 'night' if '00:00' <= x <= '05:00' or '19:00' <= x <= '23:00' else 'day')
or you can add ordered = false parameter using your method
input ->
df = pd.DataFrame(columns=['DateTime', 'Value', 'Date', 'Time'], data=[
['2022-09-18 06:00:00', 5.4, '18/09/2022', '06:00'],
['2022-09-18 07:00:00', 6.0, '18/09/2022', '07:00'],
['2022-09-18 08:00:00', 6.5, '18/09/2022', '08:00'],
['2022-09-18 09:00:00', 6.7, '18/09/2022', '09:00'],
['2022-09-18 14:00:00', 7.9, '18/09/2022', '14:00'],
['2022-09-18 15:00:00', 7.8, '18/09/2022', '15:00'],
['2022-09-18 16:00:00', 7.6, '18/09/2022', '16:00'],
['2022-09-18 17:00:00', 6.8, '18/09/2022', '17:00'],
['2022-09-18 18:00:00', 6.4, '18/09/2022', '18:00'],
['2022-09-18 19:00:00', 5.7, '18/09/2022', '19:00'],
['2022-09-18 20:00:00', 4.8, '18/09/2022', '20:00'],
['2022-09-18 21:00:00', 5.4, '18/09/2022', '21:00'],
['2022-09-18 22:00:00', 4.7, '18/09/2022', '22:00'],
['2022-09-18 23:00:00', 4.3, '18/09/2022', '23:00'],
['2022-09-19 00:00:00', 4.1, '19/09/2022', '00:00'],
['2022-09-19 01:00:00', 4.4, '19/09/2022', '01:00'],
['2022-09-19 04:00:00', 3.5, '19/09/2022', '04:00'],
['2022-09-19 05:00:00', 2.8, '19/09/2022', '05:00'],
['2022-09-19 06:00:00', 3.8, '19/09/2022', '06:00']])
output ->
DateTime Value Date Time is_0600_0900
0 2022-09-18 06:00:00 5.4 18/09/2022 06:00 day
1 2022-09-18 07:00:00 6.0 18/09/2022 07:00 day
2 2022-09-18 08:00:00 6.5 18/09/2022 08:00 day
3 2022-09-18 09:00:00 6.7 18/09/2022 09:00 day
4 2022-09-18 14:00:00 7.9 18/09/2022 14:00 day
5 2022-09-18 15:00:00 7.8 18/09/2022 15:00 day
6 2022-09-18 16:00:00 7.6 18/09/2022 16:00 day
7 2022-09-18 17:00:00 6.8 18/09/2022 17:00 day
8 2022-09-18 18:00:00 6.4 18/09/2022 18:00 day
9 2022-09-18 19:00:00 5.7 18/09/2022 19:00 night
10 2022-09-18 20:00:00 4.8 18/09/2022 20:00 night
11 2022-09-18 21:00:00 5.4 18/09/2022 21:00 night
12 2022-09-18 22:00:00 4.7 18/09/2022 22:00 night
13 2022-09-18 23:00:00 4.3 18/09/2022 23:00 night
14 2022-09-19 00:00:00 4.1 19/09/2022 00:00 night
15 2022-09-19 01:00:00 4.4 19/09/2022 01:00 night
16 2022-09-19 04:00:00 3.5 19/09/2022 04:00 night
17 2022-09-19 05:00:00 2.8 19/09/2022 05:00 night
18 2022-09-19 06:00:00 3.8 19/09/2022 06:00 day
You have two options.
Either you don't care about the order and you can set ordered=False as parameter of cut:
df['period'] = pd.cut(pd.to_datetime(df.DateTime).dt.hour,
bins=[0, 5, 17, 23],
labels=['night', 'morning', 'night'],
ordered=False,
include_lowest=True)
Or you care to have night and morning ordered, in which case you can further convert to ordered Categorical:
df['period'] = pd.Categorical(df['period'], categories=['night', 'morning'], ordered=True)
output:
DateTime Value Date Time period
0 2022-09-18 06:00:00 5.4 18/09/2022 06:00 morning
1 2022-09-18 07:00:00 6.0 18/09/2022 07:00 morning
2 2022-09-18 08:00:00 6.5 18/09/2022 08:00 morning
3 2022-09-18 09:00:00 6.7 18/09/2022 09:00 morning
8 2022-09-18 14:00:00 7.9 18/09/2022 14:00 morning
9 2022-09-18 15:00:00 7.8 18/09/2022 15:00 morning
10 2022-09-18 16:00:00 7.6 18/09/2022 16:00 morning
11 2022-09-18 17:00:00 6.8 18/09/2022 17:00 morning
12 2022-09-18 18:00:00 6.4 18/09/2022 18:00 night
13 2022-09-18 19:00:00 5.7 18/09/2022 19:00 night
14 2022-09-18 20:00:00 4.8 18/09/2022 20:00 night
15 2022-09-18 21:00:00 5.4 18/09/2022 21:00 night
16 2022-09-18 22:00:00 4.7 18/09/2022 22:00 night
17 2022-09-18 23:00:00 4.3 18/09/2022 23:00 night
18 2022-09-19 00:00:00 4.1 19/09/2022 00:00 night
19 2022-09-19 01:00:00 4.4 19/09/2022 01:00 night
22 2022-09-19 04:00:00 3.5 19/09/2022 04:00 night
23 2022-09-19 05:00:00 2.8 19/09/2022 05:00 night
24 2022-09-19 06:00:00 3.8 19/09/2022 06:00 morning
column:
df['period']
0 morning
1 morning
2 morning
...
23 night
24 morning
Name: period, dtype: category
Categories (2, object): ['morning', 'night']

Sum hourly values between 2 dates in pandas

I have a df like this:
DATE PP
0 2011-12-20 07:00:00 0.0
1 2011-12-20 08:00:00 0.0
2 2011-12-20 09:00:00 2.0
3 2011-12-20 10:00:00 0.0
4 2011-12-20 11:00:00 0.0
5 2011-12-20 12:00:00 0.0
6 2011-12-20 13:00:00 0.0
7 2011-12-20 14:00:00 5.0
8 2011-12-20 15:00:00 0.0
9 2011-12-20 16:00:00 0.0
10 2011-12-20 17:00:00 2.0
11 2011-12-20 18:00:00 0.0
12 2011-12-20 19:00:00 0.0
13 2011-12-20 20:00:00 1.0
14 2011-12-20 21:00:00 0.0
15 2011-12-20 22:00:00 0.0
16 2011-12-20 23:00:00 0.0
17 2011-12-21 00:00:00 0.0
18 2011-12-21 01:00:00 3.0
19 2011-12-21 02:00:00 0.0
20 2011-12-21 03:00:00 0.0
21 2011-12-21 04:00:00 0.0
22 2011-12-21 05:00:00 0.0
23 2011-12-21 06:00:00 5.0
24 2011-12-21 07:00:00 0.0
... .... ... ...
75609 2020-08-05 16:00:00 0.0
75610 2020-08-05 19:00:00 0.0
[75614 rows x 2 columns]
I want the cumulative values of PP column between 2 specific hourly dates in different days. I want the sum of every 07:00:00 from one day to the 07:00:00 of the next day. For example i want the cumulative values of PP from 2011-12-20 07:00:00 to 2011-12-21 07:00:00:
Expected result:
DATE CUMULATIVE VALUES PP
0 2011-12-20 18
1 2011-12-21 5
2 2011-12-22 10
etc... etc... ...
I tried this:
df['DAY'] = df['DATE'].dt.strftime('%d')
cumulatives=pd.DataFrame(df.groupby(['DAY'])['PP'].sum())
But this only sums the entire day, not between 7:00:00 to 7:00:00 of days.
Data:
{'DATE': ['2011-12-20 07:00:00', '2011-12-20 08:00:00', '2011-12-20 09:00:00',
'2011-12-20 10:00:00', '2011-12-20 11:00:00', '2011-12-20 12:00:00',
'2011-12-20 13:00:00', '2011-12-20 14:00:00', '2011-12-20 15:00:00',
'2011-12-20 16:00:00', '2011-12-20 17:00:00', '2011-12-20 18:00:00',
'2011-12-20 19:00:00', '2011-12-20 20:00:00', '2011-12-20 21:00:00',
'2011-12-20 22:00:00', '2011-12-20 23:00:00', '2011-12-21 00:00:00',
'2011-12-21 01:00:00', '2011-12-21 02:00:00', '2011-12-21 03:00:00',
'2011-12-21 04:00:00', '2011-12-21 05:00:00', '2011-12-21 06:00:00',
'2011-12-21 07:00:00', '2020-08-05 16:00:00', '2020-08-05 19:00:00'],
'PP': [0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 5.0, 0.0, 0.0, 2.0, 0.0, 0.0, 1.0,
0.0, 0.0, 0.0, 0.0, 3.0, 0.0, 0.0, 0.0, 0.0, 5.0, 0.0, 0.0, 0.0]}
One way is to subtract 7hours from date so that each day starts at 17:00 of the previous day; then groupby.sum fetches the desired output:
df['DATE'] = pd.to_datetime(df['DATE'])
out = df.groupby(df['DATE'].sub(pd.to_timedelta('7h')).dt.date)['PP'].sum().reset_index(name='SUM')
Output:
DATE SUM
0 2011-12-20 18.0
1 2011-12-21 0.0
2 2020-08-05 0.0

Delete rows above certain value once number is reached

I have a large dataset where I am interested in the part where it shuts down and when it is shut down. However, the data also includes data of the startup which I want to filter out.
The data goes down to <0.2, stays there for a while and then goes up again >0.2. I want to delete the part where it has been <0.2 before and is going up to >0.2.
I have used a standard filter, but since I am still interested in the first part this does not seem to work. Just looking at the derivative is also not an option since the value can go up and down in the beginning as well, the only difference with the latter part is that it has been <0.2 before.
How can I do this?
import pandas as pd
data = {
"Date and Time": ["2020-06-07 00:00", "2020-06-07 00:01", "2020-06-07 00:02", "2020-06-07 00:03", "2020-06-07 00:04", "2020-06-07 00:05", "2020-06-07 00:06", "2020-06-07 00:07", "2020-06-07 00:08", "2020-06-07 00:09", "2020-06-07 00:10", "2020-06-07 00:11", "2020-06-07 00:12", "2020-06-07 00:13", "2020-06-07 00:14", "2020-06-07 00:15", "2020-06-07 00:16", "2020-06-07 00:17", "2020-06-07 00:18", "2020-06-07 00:19", "2020-06-07 00:20", "2020-06-07 00:21", "2020-06-07 00:22", "2020-06-07 00:23", "2020-06-07 00:24", "2020-06-07 00:25", "2020-06-07 00:26", "2020-06-07 00:27", "2020-06-07 00:28", "2020-06-07 00:29"],
"Value": [16.2, 15.1, 13.8, 12.0, 11.9, 12.1, 10.8, 9.8, 8.3, 6.2, 4.3, 4.2, 4.2, 3.3, 1.8, 0.1, 0.05, 0.15, 0.1, 0.18, 0.25, 1, 4, 8, 12.0, 12.0, 12.0, 12.0, 12.0, 12.0],
}
df = pd.DataFrame(data)
Required output:
data = {
"Date and Time": ["2020-06-07 00:00", "2020-06-07 00:01", "2020-06-07 00:02", "2020-06-07 00:03", "2020-06-07 00:04", "2020-06-07 00:05", "2020-06-07 00:06", "2020-06-07 00:07", "2020-06-07 00:08", "2020-06-07 00:09", "2020-06-07 00:10", "2020-06-07 00:11", "2020-06-07 00:12", "2020-06-07 00:13", "2020-06-07 00:14", "2020-06-07 00:15", "2020-06-07 00:16", "2020-06-07 00:17", "2020-06-07 00:18", "2020-06-07 00:19"],
"Value": [16.2, 15.1, 13.8, 12.0, 11.9, 12.1, 10.8, 9.8, 8.3, 6.2, 4.3, 4.2, 4.2, 3.3, 1.8, 0.1, 0.05, 0.15, 0.1, 0.18],
}
You can identify the switching points (above 0.2 to under and vice versa) using (df['Value'] < 0.2).diff() and then use cumsum. To remove any parts of the dataframe after the value has been below 0.2 for any period of time, simply remove any rows with a cumsum of 2 or more.
s = (df['Value'] < 0.2).diff().cumsum()
df.loc[s < 2]
Result:
Date and Time Value
1 2020-06-07 00:01 15.10
2 2020-06-07 00:02 13.80
3 2020-06-07 00:03 12.00
4 2020-06-07 00:04 11.90
5 2020-06-07 00:05 12.10
6 2020-06-07 00:06 10.80
7 2020-06-07 00:07 9.80
8 2020-06-07 00:08 8.30
9 2020-06-07 00:09 6.20
10 2020-06-07 00:10 4.30
11 2020-06-07 00:11 4.20
12 2020-06-07 00:12 4.20
13 2020-06-07 00:13 3.30
14 2020-06-07 00:14 1.80
15 2020-06-07 00:15 0.10
16 2020-06-07 00:16 0.05
17 2020-06-07 00:17 0.15
18 2020-06-07 00:18 0.10
19 2020-06-07 00:19 0.18
You can set boolean masks on the required condition
it has been <0.2 before and is going up to >0.2.
and then filter:
# mask #1 to have the sequence ever been < 0.2
m1 = df['Value'].lt(0.2).cummax()
# mask #2 to have the sequence values are > 0.2
m2 = df['Value'].gt(0.2)
# Final mask to have the negation of BOTH (m1 and m2)
mask = ~(m1 & m2)
df.loc[mask]
The first mask make use of cummax() to ensure the sequence has ever been < 0.2.
The second mask is to ensure the sequence going up to > 0.2
Final mask is to execute the action:
to delete the part where... [the conditions met]
Result:
Date and Time Value
0 2020-06-07 00:00 16.20
1 2020-06-07 00:01 15.10
2 2020-06-07 00:02 13.80
3 2020-06-07 00:03 12.00
4 2020-06-07 00:04 11.90
5 2020-06-07 00:05 12.10
6 2020-06-07 00:06 10.80
7 2020-06-07 00:07 9.80
8 2020-06-07 00:08 8.30
9 2020-06-07 00:09 6.20
10 2020-06-07 00:10 4.30
11 2020-06-07 00:11 4.20
12 2020-06-07 00:12 4.20
13 2020-06-07 00:13 3.30
14 2020-06-07 00:14 1.80
15 2020-06-07 00:15 0.10
16 2020-06-07 00:16 0.05
17 2020-06-07 00:17 0.15
18 2020-06-07 00:18 0.10
19 2020-06-07 00:19 0.18

How to convert a pandas time series with hour (h) as index unit into pandas datetime format?

I am working on time-series data, where my pandas dataframe has indices specified in hours, like this:
[0.0, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 2.2, 2.4, ...]
This goes on for a few thousand hours. I know that the first measurement was taken on, let's say, May 1, 2017 12:00. How do I use this information to turn my indices into pandas datetime format?
You can add hours to index by parameter origin in to_datetime for DatetimeIndex:
idx = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 2.2, 2.4]
df = pd.DataFrame({'a':range(13)}, index=idx)
start = 'May 1, 2017 12:00'
df.index = pd.to_datetime(df.index, origin=start, unit='h')
print (df)
a
2017-05-01 12:00:00 0
2017-05-01 12:12:00 1
2017-05-01 12:24:00 2
2017-05-01 12:36:00 3
2017-05-01 12:48:00 4
2017-05-01 13:00:00 5
2017-05-01 13:12:00 6
2017-05-01 13:24:00 7
2017-05-01 13:36:00 8
2017-05-01 13:48:00 9
2017-05-01 14:00:00 10
2017-05-01 14:12:00 11
2017-05-01 14:24:00 12
You can use pandas.date_range to specify the amount of periods based on the length of your index (in this case list) and specify the frequency, which is in this case 12min or 1/5 H:
l = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 2.2, 2.4]
data = {'Num':np.random.randint(1, 10, size=len(l))}
idx = pd.date_range(start=pd.Timestamp(2017, 5, 1, 12), periods=len(l), freq='12T')
df = pd.DataFrame(data = data, index= idx)
print(df)
Num
2017-05-01 12:00:00 8
2017-05-01 12:12:00 3
2017-05-01 12:24:00 3
2017-05-01 12:36:00 4
2017-05-01 12:48:00 8
2017-05-01 13:00:00 3
2017-05-01 13:12:00 6
2017-05-01 13:24:00 3
2017-05-01 13:36:00 4
2017-05-01 13:48:00 9
2017-05-01 14:00:00 5
2017-05-01 14:12:00 2
2017-05-01 14:24:00 6

Pandas: filling missing values in time series forward using a formula

I have a time series of data in a DataFrame that has missing values at both the beginning and the end of the sample.
I'm trying to fill the missing values at the end by growing it forward using a simple AR(1) process.
For example,
X(t+1) - X(t) = 0.5*[X(t) - X(t-1)]
A = [np.nan, np.nan, 5.5, 5.7, 5.9, 6.1, 6.0, 5.9, np.nan, np.nan, np.nan]
df = pd.DataFrame({'A':A}, index = pd.date_range(start = '2010',
periods = len(A),
freq = "QS"))
A
2010-01-01 5.5
2010-04-01 5.7
2010-07-01 5.9
2010-10-01 6.1
2011-01-01 6.0
2011-04-01 5.9
2011-07-01 NaN
2011-10-01 NaN
2012-01-01 NaN
What I want:
A
2010-01-01 NaN
2010-04-01 NaN
2010-07-01 5.5000
2010-10-01 5.7000
2011-01-01 5.9000
2011-04-01 6.1000
2011-07-01 6.0000
2011-10-01 5.9000
2012-01-01 5.8500
2012-04-01 5.8250
2012-07-01 5.8125
Grabbing the next entry in the series is relatively easy:
NEXT = 0.5*df.dropna().diff().iloc[-1] + df.dropna().iloc[-1]
But appending that to the DataFrame in a nice ways is giving me some trouble.
You can use the below code to do the operation:
A = [np.nan, np.nan, 5.5, 5.7, 5.9, 6.1, 6.0, 5.9, np.nan, np.nan, np.nan]
df = pd.DataFrame({'A': A}, index=pd.date_range(start='2010', periods=len(A), freq="QS"))
for id in df[df.A.isnull() == True].index:
df.loc[id, 'A'] = 1.5 * df.A.shift().loc[id] - 0.5 * df.A.shift(2).loc[id]
#Output dataframe
A
2010-01-01 NaN
2010-04-01 NaN
2010-07-01 5.5000
2010-10-01 5.7000
2011-01-01 5.9000
2011-04-01 6.1000
2011-07-01 6.0000
2011-10-01 5.9000
2012-01-01 5.8500
2012-04-01 5.8250
2012-07-01 5.8125

Categories