pandas merge/rearrange/sum single dataframe - python

I have following dataframe:
latitude longitude d1 d2 ar merge_time
0 15 10.0 12/1/1981 0:00 12/4/1981 3:00 2.317681391 1981-12-04 04:00:00
1 15 10.1 12/1/1981 0:00 12/1/1981 3:00 2.293604127 1981-12-01 04:00:00
2 15 10.2 12/1/1981 0:00 12/1/1981 2:00 2.264552161 1981-12-01 03:00:00
3 15 10.3 12/1/1981 0:00 12/4/1981 2:00 2.278556423 1981-12-04 03:00:00
4 15 10.1 12/1/1981 4:00 12/1/1981 22:00 2.168275766 1981-12-01 23:00:00
5 15 10.2 12/1/1981 3:00 12/1/1981 21:00 2.114636628 1981-12-01 22:00:00
6 15 10.4 12/1/1981 0:00 12/2/1981 17:00 1.384415903 1981-12-02 18:00:00
7 15 10.1 12/2/1981 8:00 12/2/1981 11:00 2.293604127 1981-12-01 12:00:00
I want to group and rearrange above dataframe (value of column ar) based on following criteria:
1. Values latitude and longitude are equal and
2. Values d2 and merge_time are equal withing grouped in 1
Here is desired output:
latitude longitude d1 d2 ar
15 10 12/1/1981 0:00 12/4/1981 3:00 2.317681391
15 10.1 12/1/1981 0:00 12/1/1981 22:00 4.461879893
15 10.2 12/1/1981 0:00 12/1/1981 21:00 4.379188789
15 10.3 12/1/1981 0:00 12/4/1981 2:00 2.278556423
15 10.4 12/1/1981 0:00 12/2/1981 17:00 1.384415903
15 10.1 12/2/1981 8:00 12/2/1981 11:00 2.293604127
How can I achieve this?
Any help is appreceated.

after expressing your requirements in comments
group by location (longitude & latitude)
find rows within this grouping that are contiguous in time
group and aggregate these contiguous sections
import io
import pandas as pd
df = pd.read_csv(io.StringIO(""" latitude longitude d1 d2 ar merge_time
0 15 10.0 12/1/1981 0:00 12/4/1981 3:00 2.317681391 1981-12-04 04:00:00
1 15 10.1 12/1/1981 0:00 12/1/1981 3:00 2.293604127 1981-12-01 04:00:00
2 15 10.2 12/1/1981 0:00 12/1/1981 2:00 2.264552161 1981-12-01 03:00:00
3 15 10.3 12/1/1981 0:00 12/4/1981 2:00 2.278556423 1981-12-04 03:00:00
4 15 10.1 12/1/1981 4:00 12/1/1981 22:00 2.168275766 1981-12-01 23:00:00
5 15 10.2 12/1/1981 3:00 12/1/1981 21:00 2.114636628 1981-12-01 22:00:00
6 15 10.4 12/1/1981 0:00 12/2/1981 17:00 1.384415903 1981-12-02 18:00:00
7 15 10.1 12/2/1981 8:00 12/2/1981 11:00 2.293604127 1981-12-01 12:00:00"""), sep="\s\s+", engine="python")
df = df.assign(**{c:pd.to_datetime(df[c]) for c in ["d1","d2","merge_time"]})
df.groupby(["latitude", "longitude"]).apply(
lambda d: d.groupby(
(d["d1"] != (d["d2"].shift() + pd.Timedelta("1H"))).cumsum(), as_index=False
).agg({"d1": "min", "d2": "max", "ar": "sum"})
).droplevel(2,0).reset_index()
output
latitude
longitude
d1
d2
ar
0
15
10
1981-12-01 00:00:00
1981-12-04 03:00:00
2.31768
1
15
10.1
1981-12-01 00:00:00
1981-12-01 22:00:00
4.46188
2
15
10.1
1981-12-02 08:00:00
1981-12-02 11:00:00
2.2936
3
15
10.2
1981-12-01 00:00:00
1981-12-01 21:00:00
4.37919
4
15
10.3
1981-12-01 00:00:00
1981-12-04 02:00:00
2.27856
5
15
10.4
1981-12-01 00:00:00
1981-12-02 17:00:00
1.38442

Related

Splitting Dataframe time into morning and evening

I have a df that looks like this (shortened):
DateTime Value Date Time
0 2022-09-18 06:00:00 5.4 18/09/2022 06:00
1 2022-09-18 07:00:00 6.0 18/09/2022 07:00
2 2022-09-18 08:00:00 6.5 18/09/2022 08:00
3 2022-09-18 09:00:00 6.7 18/09/2022 09:00
8 2022-09-18 14:00:00 7.9 18/09/2022 14:00
9 2022-09-18 15:00:00 7.8 18/09/2022 15:00
10 2022-09-18 16:00:00 7.6 18/09/2022 16:00
11 2022-09-18 17:00:00 6.8 18/09/2022 17:00
12 2022-09-18 18:00:00 6.4 18/09/2022 18:00
13 2022-09-18 19:00:00 5.7 18/09/2022 19:00
14 2022-09-18 20:00:00 4.8 18/09/2022 20:00
15 2022-09-18 21:00:00 5.4 18/09/2022 21:00
16 2022-09-18 22:00:00 4.7 18/09/2022 22:00
17 2022-09-18 23:00:00 4.3 18/09/2022 23:00
18 2022-09-19 00:00:00 4.1 19/09/2022 00:00
19 2022-09-19 01:00:00 4.4 19/09/2022 01:00
22 2022-09-19 04:00:00 3.5 19/09/2022 04:00
23 2022-09-19 05:00:00 2.8 19/09/2022 05:00
24 2022-09-19 06:00:00 3.8 19/09/2022 06:00
I want to create a new column where i split the between day and night like this:
00:00 - 05:00 night ,
06:00 - 18:00 day ,
19:00 - 23:00 night
But apparently one can't use same label? How can I solve this problem? Here is my code
df['period'] = pd.cut(pd.to_datetime(df.DateTime).dt.hour,
bins=[0, 5, 17, 23],
labels=['night', 'morning', 'night'],
include_lowest=True)
It's returning
ValueError: labels must be unique if ordered=True; pass ordered=False for duplicate labels
if i understood correctly, if time is between 00:00 - 05:00 or 19:00 - 23:00, you want your new column to say 'night', else 'day', well here's that code:
df['day/night'] = df['Time'].apply(lambda x: 'night' if '00:00' <= x <= '05:00' or '19:00' <= x <= '23:00' else 'day')
or you can add ordered = false parameter using your method
input ->
df = pd.DataFrame(columns=['DateTime', 'Value', 'Date', 'Time'], data=[
['2022-09-18 06:00:00', 5.4, '18/09/2022', '06:00'],
['2022-09-18 07:00:00', 6.0, '18/09/2022', '07:00'],
['2022-09-18 08:00:00', 6.5, '18/09/2022', '08:00'],
['2022-09-18 09:00:00', 6.7, '18/09/2022', '09:00'],
['2022-09-18 14:00:00', 7.9, '18/09/2022', '14:00'],
['2022-09-18 15:00:00', 7.8, '18/09/2022', '15:00'],
['2022-09-18 16:00:00', 7.6, '18/09/2022', '16:00'],
['2022-09-18 17:00:00', 6.8, '18/09/2022', '17:00'],
['2022-09-18 18:00:00', 6.4, '18/09/2022', '18:00'],
['2022-09-18 19:00:00', 5.7, '18/09/2022', '19:00'],
['2022-09-18 20:00:00', 4.8, '18/09/2022', '20:00'],
['2022-09-18 21:00:00', 5.4, '18/09/2022', '21:00'],
['2022-09-18 22:00:00', 4.7, '18/09/2022', '22:00'],
['2022-09-18 23:00:00', 4.3, '18/09/2022', '23:00'],
['2022-09-19 00:00:00', 4.1, '19/09/2022', '00:00'],
['2022-09-19 01:00:00', 4.4, '19/09/2022', '01:00'],
['2022-09-19 04:00:00', 3.5, '19/09/2022', '04:00'],
['2022-09-19 05:00:00', 2.8, '19/09/2022', '05:00'],
['2022-09-19 06:00:00', 3.8, '19/09/2022', '06:00']])
output ->
DateTime Value Date Time is_0600_0900
0 2022-09-18 06:00:00 5.4 18/09/2022 06:00 day
1 2022-09-18 07:00:00 6.0 18/09/2022 07:00 day
2 2022-09-18 08:00:00 6.5 18/09/2022 08:00 day
3 2022-09-18 09:00:00 6.7 18/09/2022 09:00 day
4 2022-09-18 14:00:00 7.9 18/09/2022 14:00 day
5 2022-09-18 15:00:00 7.8 18/09/2022 15:00 day
6 2022-09-18 16:00:00 7.6 18/09/2022 16:00 day
7 2022-09-18 17:00:00 6.8 18/09/2022 17:00 day
8 2022-09-18 18:00:00 6.4 18/09/2022 18:00 day
9 2022-09-18 19:00:00 5.7 18/09/2022 19:00 night
10 2022-09-18 20:00:00 4.8 18/09/2022 20:00 night
11 2022-09-18 21:00:00 5.4 18/09/2022 21:00 night
12 2022-09-18 22:00:00 4.7 18/09/2022 22:00 night
13 2022-09-18 23:00:00 4.3 18/09/2022 23:00 night
14 2022-09-19 00:00:00 4.1 19/09/2022 00:00 night
15 2022-09-19 01:00:00 4.4 19/09/2022 01:00 night
16 2022-09-19 04:00:00 3.5 19/09/2022 04:00 night
17 2022-09-19 05:00:00 2.8 19/09/2022 05:00 night
18 2022-09-19 06:00:00 3.8 19/09/2022 06:00 day
You have two options.
Either you don't care about the order and you can set ordered=False as parameter of cut:
df['period'] = pd.cut(pd.to_datetime(df.DateTime).dt.hour,
bins=[0, 5, 17, 23],
labels=['night', 'morning', 'night'],
ordered=False,
include_lowest=True)
Or you care to have night and morning ordered, in which case you can further convert to ordered Categorical:
df['period'] = pd.Categorical(df['period'], categories=['night', 'morning'], ordered=True)
output:
DateTime Value Date Time period
0 2022-09-18 06:00:00 5.4 18/09/2022 06:00 morning
1 2022-09-18 07:00:00 6.0 18/09/2022 07:00 morning
2 2022-09-18 08:00:00 6.5 18/09/2022 08:00 morning
3 2022-09-18 09:00:00 6.7 18/09/2022 09:00 morning
8 2022-09-18 14:00:00 7.9 18/09/2022 14:00 morning
9 2022-09-18 15:00:00 7.8 18/09/2022 15:00 morning
10 2022-09-18 16:00:00 7.6 18/09/2022 16:00 morning
11 2022-09-18 17:00:00 6.8 18/09/2022 17:00 morning
12 2022-09-18 18:00:00 6.4 18/09/2022 18:00 night
13 2022-09-18 19:00:00 5.7 18/09/2022 19:00 night
14 2022-09-18 20:00:00 4.8 18/09/2022 20:00 night
15 2022-09-18 21:00:00 5.4 18/09/2022 21:00 night
16 2022-09-18 22:00:00 4.7 18/09/2022 22:00 night
17 2022-09-18 23:00:00 4.3 18/09/2022 23:00 night
18 2022-09-19 00:00:00 4.1 19/09/2022 00:00 night
19 2022-09-19 01:00:00 4.4 19/09/2022 01:00 night
22 2022-09-19 04:00:00 3.5 19/09/2022 04:00 night
23 2022-09-19 05:00:00 2.8 19/09/2022 05:00 night
24 2022-09-19 06:00:00 3.8 19/09/2022 06:00 morning
column:
df['period']
0 morning
1 morning
2 morning
...
23 night
24 morning
Name: period, dtype: category
Categories (2, object): ['morning', 'night']

Resample a time-series data at the end of the month and at the end of the day

I have a timeseries data with the following format.
DateShort (%d/%m/%Y)
TimeFrom
TimeTo
Value
1/1/2018
0:00
1:00
6414
1/1/2018
1:00
2:00
6153
...
...
...
...
1/1/2018
23:00
0:00
6317
2/1/2018
0:00
1:00
6046
...
...
...
...
I would like to re-sample data at the end of the month and at the end of the day.
The dataset could be retrieved from https://pastebin.com/raw/NWdigN97
pandas.DataFrame.resample() provides 'M' rule to retrieve data from the end of the month but at the beginning of the day.
See https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html
Do you have better solution to accomplish this?
I have the following sample code:
import numpy as np
import pandas as pd
ds_url = 'https://pastebin.com/raw/NWdigN97'
df = pd.read_csv(ds_url, header=0)
df['DateTime'] = pd.to_datetime(
df['DateShort'] + ' ' + df['TimeFrom'],
format='%d/%m/%Y %H:%M'
)
df.drop('DateShort', axis=1, inplace=True)
df.set_index('DateTime', inplace=True)
df.resample('M').asfreq()
The output is
TimeFrom TimeTo Value
DateTime
2018-01-31 0:00 1:00 7215
2018-02-28 0:00 1:00 8580
2018-03-31 0:00 1:00 6202
2018-04-30 0:00 1:00 5369
2018-05-31 0:00 1:00 5840
2018-06-30 0:00 1:00 5730
2018-07-31 0:00 1:00 5979
2018-08-31 0:00 1:00 6009
2018-09-30 0:00 1:00 5430
2018-10-31 0:00 1:00 6587
2018-11-30 0:00 1:00 7948
2018-12-31 0:00 1:00 6193
However, the correct output should be
TimeFrom TimeTo Value
DateTime
2018-01-31 23:00 0:00 7605
2018-02-28 23:00 0:00 8790
2018-03-31 23:00 0:00 5967
2018-04-30 23:00 0:00 5595
2018-05-31 23:00 0:00 5558
2018-06-30 23:00 0:00 5153
2018-07-31 23:00 0:00 5996
2018-08-31 23:00 0:00 5757
2018-09-30 23:00 0:00 5785
2018-10-31 23:00 0:00 6437
2018-11-30 23:00 0:00 7830
2018-12-31 23:00 0:00 6767
Try this:
df.groupby(pd.Grouper(freq='M')).last()
Output:
TimeFrom TimeTo Value
DateTime
2018-01-31 23:00 0:00 7605
2018-02-28 23:00 0:00 8790
2018-03-31 23:00 0:00 5967
2018-04-30 23:00 0:00 5595
2018-05-31 23:00 0:00 5558
2018-06-30 23:00 0:00 5153
2018-07-31 23:00 0:00 5996
2018-08-31 23:00 0:00 5757
2018-09-30 23:00 0:00 5785
2018-10-31 23:00 0:00 6437
2018-11-30 23:00 0:00 7830
2018-12-31 23:00 0:00 6707

performance issues with pandas apply

I have the following dataframe df
time u10 ... av kont
latitude longitude ...
51.799999 -3.2 2011-01-07 09:00:00 -2.217477 ... 0.008106 None
-3.1 2011-01-07 09:00:00 -2.137205 ... 0.008202 None
51.900002 -3.1 2011-01-07 09:00:00 -2.276076 ... 0.008310 None
-3.1 2011-01-07 10:00:00 -1.548405 ... 0.006344 None
-3.0 2011-01-07 09:00:00 -2.200620 ... 0.008537 None
52.200001 -3.9 2011-01-05 23:00:00 1.393586 ... 0.005413 None
-3.8 2011-01-05 21:00:00 1.972752 ... 0.007624 None
-3.8 2011-01-05 22:00:00 1.732336 ... 0.006696 None
-3.8 2011-01-05 23:00:00 1.551723 ... 0.005837 None
-3.8 2011-01-06 00:00:00 1.377130 ... 0.004979 None
-3.7 2011-01-05 21:00:00 2.124066 ... 0.008008 None
-3.7 2011-01-05 22:00:00 1.892480 ... 0.007125 None
-3.7 2011-01-05 23:00:00 1.710662 ... 0.006296 None
-3.6 2011-01-05 21:00:00 2.259727 ... 0.008230 None
-3.6 2011-01-05 22:00:00 2.044596 ... 0.007428 None
-3.6 2011-01-05 23:00:00 1.865990 ... 0.006652 None
52.299999 -3.8 2011-01-05 23:00:00 1.652063 ... 0.006964 None
The entire dataframe can be downloaded from here.
I need to sum groups within latitude, longitude and kont. I am doing this with following function though apply:
def summarize(group):
s = group['kont'].eq('from').cumsum()
return group.groupby(s).agg(
t2m=('t2m', 'mean'),
av=('av', 'sum'),
ah=('tp', 'sum'),
d1=('time', 'min'),
d2=('time', 'max')
)
df=df.groupby(['latitude', 'longitude']).apply(summarize).reset_index(level=-1, drop=True)
output is given here.
However, I need to run this on a large dataframe and it takes hours to finish these operations, probably because of the use apply.
Is there any pure pandas way of speeding this up? Is there any other way i.e. dask?
You can try to change the codes as follows, without using .apply():
s = df['kont'].eq('from').cumsum()
df = (df.groupby(['latitude', 'longitude', s])
.agg(
t2m=('t2m', 'mean'),
av=('av', 'sum'),
ah=('tp', 'sum'),
d1=('time', 'min'),
d2=('time', 'max')
)
).reset_index(level=-1, drop=True)
Result:
Result is the same as running the original codes with .apply():
print(df)
t2m av ah d1 d2
latitude longitude
51.799999 -3.2 0.099451 0.008106 0.010043 1/7/2011 9:00 1/7/2011 9:00
-3.1 0.343713 0.008202 0.010375 1/7/2011 9:00 1/7/2011 9:00
51.900002 -3.1 0.097055 0.014654 0.020506 1/7/2011 10:00 1/7/2011 9:00
-3.0 0.261560 0.008537 0.010545 1/7/2011 9:00 1/7/2011 9:00
52.200001 -3.9 0.292841 0.005413 0.010704 1/5/2011 23:00 1/5/2011 23:00
-3.8 0.207666 0.025135 0.042585 1/5/2011 21:00 1/6/2011 0:00
-3.7 0.354354 0.021428 0.031826 1/5/2011 21:00 1/5/2011 23:00
-3.6 0.333602 0.022311 0.031084 1/5/2011 21:00 1/5/2011 23:00
52.299999 -3.8 0.012537 0.012992 0.024472 1/5/2011 23:00 1/6/2011 0:00
-3.7 -0.146262 0.030848 0.047126 1/5/2011 21:00 1/6/2011 0:00
-3.6 0.150072 0.031348 0.044772 1/5/2011 21:00 1/6/2011 0:00
52.400002 -3.8 0.240045 0.007225 0.013877 1/6/2011 0:00 1/6/2011 0:00
-3.7 0.286981 0.015497 0.025990 1/5/2011 23:00 1/6/2011 0:00
-3.6 0.167067 0.024722 0.036369 1/5/2011 22:00 1/6/2011 0:00
-3.5 0.199080 0.024500 0.033631 1/5/2011 22:00 1/6/2011 0:00
-3.4 0.258915 0.024050 0.030358 1/5/2011 22:00 1/6/2011 0:00
-2.8 0.359186 0.009324 0.010351 1/7/2011 11:00 1/7/2011 11:00
-2.7 0.241022 0.011714 0.010068 1/7/2011 10:00 1/7/2011 10:00
52.700001 -2.8 0.378778 0.009083 0.010874 1/6/2011 0:00 1/6/2011 0:00
-2.7 0.314325 0.019510 0.022723 1/5/2011 23:00 1/6/2011 0:00
52.799999 -3.7 0.214777 0.007146 0.011296 1/6/2011 0:00 1/6/2011 0:00
-3.6 0.294733 0.007325 0.010927 1/6/2011 0:00 1/6/2011 0:00
-3.6 0.300104 0.005927 0.010070 1/7/2011 17:00 1/7/2011 17:00
-3.5 0.314325 0.007460 0.010498 1/6/2011 0:00 1/6/2011 0:00
-3.5 0.271021 0.005504 0.010115 1/7/2011 17:00 1/7/2011 17:00
52.900002 -3.9 0.204980 0.006496 0.011364 1/6/2011 0:00 1/6/2011 0:00
-3.8 0.378778 0.006653 0.011136 1/6/2011 0:00 1/6/2011 0:00
-3.6 0.370264 0.005485 0.010155 1/7/2011 18:00 1/7/2011 18:00
-3.5 0.269434 0.007051 0.010269 1/6/2011 0:00 1/6/2011 0:00
-3.5 0.372156 0.005216 0.010152 1/7/2011 18:00 1/7/2011 18:00
53.000000 -3.9 0.050775 0.006166 0.010510 1/6/2011 0:00 1/6/2011 0:00
53.200001 -1.9 0.396478 0.017476 0.012246 1/5/2011 23:00 1/5/2011 23:00
54.200001 -2.3 0.380670 0.014101 0.010786 1/6/2011 0:00 1/6/2011 0:00
54.299999 -2.4 0.183496 0.011351 0.010115 1/6/2011 0:00 1/6/2011 0:00
-2.3 0.122034 0.025713 0.020119 1/5/2011 23:00 1/6/2011 0:00
Performance Comparison:
Original codes using .apply():
%%timeit
def summarize(group):
s = group['kont'].eq('from').cumsum()
return group.groupby(s).agg(
t2m=('t2m', 'mean'),
av=('av', 'sum'),
ah=('tp', 'sum'),
d1=('time', 'min'),
d2=('time', 'max')
)
df.groupby(['latitude', 'longitude']).apply(summarize).reset_index(level=-1, drop=True)
303 ms ± 33.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Modified codes without using .apply():
%%timeit
s = df['kont'].eq('from').cumsum()
(df.groupby(['latitude', 'longitude', s])
.agg(
t2m=('t2m', 'mean'),
av=('av', 'sum'),
ah=('tp', 'sum'),
d1=('time', 'min'),
d2=('time', 'max')
)
).reset_index(level=-1, drop=True)
15.8 ms ± 236 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
303ms vs 15.8ms: ~ 19.2 times faster

Convert 'object' column to datetime

I currently have the following dataframe (with seven days, one day displayed below). Hours run from 01:00 to 24:00. How do I convert the HourEnding column to datetime format and combine it with the date_time column (which is already in datetime format)?
HourEnding LMP date_time
0 01:00 165.27 2021-02-20
1 02:00 155.89 2021-02-20
2 03:00 154.50 2021-02-20
3 04:00 153.44 2021-02-20
4 05:00 210.15 2021-02-20
5 06:00 298.90 2021-02-20
6 07:00 152.71 2021-02-20
7 08:00 204.61 2021-02-20
8 09:00 155.77 2021-02-20
9 10:00 90.64 2021-02-20
10 11:00 57.17 2021-02-20
11 12:00 43.74 2021-02-20
12 13:00 33.42 2021-02-20
13 14:00 5.05 2021-02-20
14 15:00 1.43 2021-02-20
15 16:00 0.99 2021-02-20
16 17:00 0.94 2021-02-20
17 18:00 12.13 2021-02-20
18 19:00 18.90 2021-02-20
19 20:00 19.04 2021-02-20
20 21:00 16.42 2021-02-20
21 22:00 14.47 2021-02-20
22 23:00 44.55 2021-02-20
23 24:00 40.51 2021-02-20
So far I've tried
df['time'] = pd.to_datetime(df['HourEnding'])
but that seems to fail because of the 24:00.
Similarly
df['time'] = pd.to_timedelta('HourEnding', 'h', errors = 'coerce')
yields a column of NaTs.
As you mentioned in the comments, hour 24 corresponds to midnight of the same day. I would simply start by replacing "24" by "00" :
df['HourEnding'] = df.HourEnding.str.replace('24:00', '00:00')
Then, convert date_time to string :
df['date_time'] = df.date_time.astype(str)
Create a new column that concatenates date_time and HourEnding :
df['date_and_hour'] = df.date_time + " " + df.HourEnding
df['date_and_hour'] = pd.to_datetime(df.date_and_hour)
Which gives you this :
>>> df
HourEnding LMP date_time date_and_hour
0 01:00 165.27 2021-02-20 2021-02-20 01:00:00
1 02:00 155.89 2021-02-20 2021-02-20 02:00:00
2 03:00 154.50 2021-02-20 2021-02-20 03:00:00
3 04:00 153.44 2021-02-20 2021-02-20 04:00:00
4 05:00 210.15 2021-02-20 2021-02-20 05:00:00
5 06:00 298.90 2021-02-20 2021-02-20 06:00:00
6 07:00 152.71 2021-02-20 2021-02-20 07:00:00
7 08:00 204.61 2021-02-20 2021-02-20 08:00:00
8 09:00 155.77 2021-02-20 2021-02-20 09:00:00
9 10:00 90.64 2021-02-20 2021-02-20 10:00:00
10 11:00 57.17 2021-02-20 2021-02-20 11:00:00
11 12:00 43.74 2021-02-20 2021-02-20 12:00:00
12 13:00 33.42 2021-02-20 2021-02-20 13:00:00
13 14:00 5.05 2021-02-20 2021-02-20 14:00:00
14 15:00 1.43 2021-02-20 2021-02-20 15:00:00
15 16:00 0.99 2021-02-20 2021-02-20 16:00:00
16 17:00 0.94 2021-02-20 2021-02-20 17:00:00
17 18:00 12.13 2021-02-20 2021-02-20 18:00:00
18 19:00 18.90 2021-02-20 2021-02-20 19:00:00
19 20:00 19.04 2021-02-20 2021-02-20 20:00:00
20 21:00 16.42 2021-02-20 2021-02-20 21:00:00
21 22:00 14.47 2021-02-20 2021-02-20 22:00:00
22 23:00 44.55 2021-02-20 2021-02-20 23:00:00
23 00:00 40.51 2021-02-20 2021-02-20 00:00:00
>>> df.dtypes
HourEnding object
LMP float64
date_time object
date_and_hour datetime64[ns]
Convert both columns to strings, then join them into a new 'datetime' column, and finally convert the 'datetime' column to datetime.
EDIT: To deal with the 1-24 hour problem, build a function to split the string and subtract 1 from each of the hours and then join:
def subtract_hour(t):
t = t.split(':')
t[0] = str(int(t[0]) - 1)
if len(t[0]) < 2:
t[0] = '0' + t[0]
return ':'.join(t)
Then you can apply this to your hour column (e.g., df['hour'] = df['hour'].apply(subtract_hour)) and proceed with joining columns and then parsing using pd.to_datetime.
EDIT 2: You just want to change '24' to '00', my bad.
def mod_midnight(t):
t = t.split(':')
if t[0] == '24':
t[0] = '00'
return ':'.join(t)

Create a new variable which is weekly min/max values of another variable in python

How can I assign value 1 to variable S when the value of column A is highest within the week? Also, how to assign value 2 to variable S when the value of B is the minimum within the week. I am working with hourly data, indexed by datetime.
Here is how my dataframe looks like:
A B S
datetime
6/14/2004 1:00 384.5 383.6 0
6/14/2004 2:00 384.3 382.3 0
6/14/2004 3:00 383.3 382.3 0
6/14/2004 4:00 383.3 382.6 0
6/14/2004 5:00 383.3 382.8 0
6/14/2004 6:00 383.3 382.5 0
6/14/2004 7:00 383.3 382.3 0
6/14/2004 8:00 383.8 382.3 0
6/14/2004 9:00 382.8 382.1 0
6/14/2004 10:00 382.6 382.1 0
I have tried using resampling weekly and getting the max value but I do not how to code this as it got more complicated than I initially thought.
Here is how I would like my final data to look like.
A B S
datetime
6/14/2004 1:00 384.5 383.6 0
6/14/2004 2:00 384.3 382.3 0
6/14/2004 3:00 383.3 382.3 0
6/14/2004 4:00 383.3 382.6 0
6/14/2004 5:00 383.3 382.8 0
6/14/2004 6:00 383.3 382.5 0
6/14/2004 7:00 383.3 382.3 0
6/14/2004 8:00 383.8 382.3 0
6/14/2004 9:00 382.8 382.1 0
6/14/2004 10:00 382.6 382.1 0
6/14/2004 11:00 382.5 381.8 0
6/14/2004 12:00 382.8 382.3 0
6/14/2004 13:00 383.1 382.3 0
6/14/2004 14:00 385.8 382.5 0
6/14/2004 15:00 385.1 383.6 0
6/14/2004 16:00 384.8 383.5 0
6/14/2004 17:00 384.8 382.5 0
6/14/2004 18:00 383.6 382.8 0
6/14/2004 19:00 383.8 382.8 0
6/14/2004 20:00 383.3 382.8 0
6/14/2004 21:00 383.1 382.6 0
6/14/2004 22:00 383.1 382.6 0
6/14/2004 23:00 383.1 382.6 0
6/15/2004 0:00 382.8 382.6 0
6/15/2004 1:00 383.3 382.6 0
6/15/2004 2:00 383.6 382.3 0
6/15/2004 3:00 383.8 382.5 0
6/15/2004 4:00 382.8 382.1 0
6/15/2004 5:00 383.0 382.1 0
6/15/2004 6:00 382.8 382.0 0
... ... ... ...
6/24/2004 20:00 402.8 401.8 0
6/24/2004 21:00 402.3 401.8 0
6/24/2004 22:00 402.3 401.8 0
6/24/2004 23:00 402.1 401.1 0
6/25/2004 0:00 402.1 401.8 0
6/25/2004 1:00 402.1 401.3 0
6/25/2004 2:00 402.1 400.1 0
6/25/2004 3:00 401.6 400.8 0
6/25/2004 4:00 401.5 400.8 0
6/25/2004 5:00 401.3 400.8 0
6/25/2004 6:00 401.1 400.6 0
6/25/2004 7:00 402.1 400.8 0
6/25/2004 8:00 402.1 400.6 0
6/25/2004 9:00 401.6 400.5 0
6/25/2004 10:00 401.8 400.8 0
6/25/2004 11:00 401.5 400.6 0
6/25/2004 12:00 401.3 400.1 0
6/25/2004 13:00 402.8 401.3 0
6/25/2004 14:00 402.8 401.0 **1**
6/25/2004 15:00 401.5 400.1 0
6/25/2004 16:00 401.6 400.6 0
6/25/2004 17:00 401.8 401.0 0
6/25/2004 18:00 402.1 400.8 0
6/25/2004 19:00 402.3 400.8 0
6/25/2004 20:00 402.6 401.6 0
6/25/2004 21:00 401.8 401.3 0
6/25/2004 22:00 401.8 400.6 0
6/28/2004 0:00 401.8 401.6 0
6/28/2004 1:00 402.3 401.6 0
6/28/2004 2:00 402.3 401.5 0
For the first week, column S would have value 1 in 6/18/2004 18:00 and value 2 in 6/15/2004 11:00
For the second week, columns S would have value 1 in 6/25/2004 14:00 and value 2 in 6/21/2004 18:00
I figured out four rules:
1. When A = max(A) within the current week, put value 1 in S. If the A maximum is not unique within the week, put 1 in S at the last occurrence of the maximum in A.
2. When B = min(B) within the current week, put value 2 in S. If the B minimum is not unique within the week, put 2 in S at the last occurrence of the minimum in B.
3. Repeat this over all weeks. The entire dataset may have 80k+ hourly data rows.
4. Within each week: if max(A) and min(B) occur at the same datetime index, leave the value 0 in S (no change).
Here is the code to read the data:
import pandas as pd
url = 'https://www.dropbox.com/s/x7wl75rkzsqgkoj/dataset.csv?dl=1'
p = pd.read_csv(url)
p.set_index('datetime', drop=True, inplace=True)
p
And here is a picture explaining how I want the output to look like:
So I reduced the size of the dataframe so we can see something and I added a column week ("w") so we can better check.
First of all, you need to set the type of your index as a datetime object so you can access the date properties, such as week to groupby on.
p.index = pd.to_datetime(p.index)
p["w"] = p.index.week
p
A B S w
datetime
2004-06-14 01:00:00 384.5 383.6 0 25
2004-06-14 09:00:00 382.8 382.1 0 25
2004-06-14 17:00:00 384.8 382.5 0 25
2004-06-15 01:00:00 383.3 382.6 0 25
2004-06-15 09:00:00 382.3 381.6 0 25
2004-06-15 17:00:00 388.6 384.6 0 25
2004-06-16 01:00:00 387.3 387.1 0 25
2004-06-16 09:00:00 388.8 387.6 0 25
2004-06-16 17:00:00 384.5 382.6 0 25
2004-06-17 01:00:00 384.6 383.6 0 25
2004-06-17 09:00:00 385.6 384.0 0 25
2004-06-17 17:00:00 386.8 386.0 0 25
2004-06-18 01:00:00 388.6 387.3 0 25
2004-06-18 09:00:00 387.5 385.8 0 25
2004-06-18 17:00:00 395.8 394.1 0 25
2004-06-21 02:00:00 394.3 392.8 0 26
2004-06-21 10:00:00 393.3 392.3 0 26
2004-06-21 18:00:00 394.8 392.1 0 26
2004-06-22 02:00:00 394.6 393.0 0 26
2004-06-22 10:00:00 394.0 392.6 0 26
2004-06-22 18:00:00 395.3 393.8 0 26
2004-06-23 02:00:00 394.3 393.6 0 26
2004-06-23 10:00:00 395.8 395.0 0 26
2004-06-23 18:00:00 394.6 393.6 0 26
2004-06-24 02:00:00 394.6 393.1 0 26
2004-06-24 10:00:00 397.8 394.8 0 26
2004-06-24 18:00:00 401.3 400.6 0 26
2004-06-25 02:00:00 402.1 400.1 0 26
2004-06-25 10:00:00 401.8 400.8 0 26
2004-06-25 18:00:00 402.1 400.8 0 26
2004-06-28 03:00:00 402.3 401.5 0 27
2004-06-28 11:00:00 402.1 400.8 0 27
2004-06-28 19:00:00 400.3 399.1 0 27
2004-06-29 03:00:00 399.6 399.1 0 27
2004-06-29 11:00:00 397.1 395.3 0 27
2004-06-29 19:00:00 392.3 391.0 0 27
2004-06-30 03:00:00 392.3 391.8 0 27
2004-06-30 11:00:00 393.6 393.1 0 27
2004-06-30 19:00:00 393.5 391.3 0 27
then, you need to define your function that you will apply on each week:
def minmax(grp):
Amax = grp.A[::-1].idxmax() # reverse your Series since you want the last occurence, and idxmax return the first in case of tie
grp.loc[Amax, "S"] = 1
Bmin = grp.B[::-1].idxmin()
if Bmin != Amax:
grp.loc[Bmin, "S"] = 2
else:
grp.loc[Bmin, "S"] = 0 # no change
return grp
and then groupby on each week per year and apply the function:
p.groupby([p.index.week, p.index.year]).apply(minmax)
A B S w
datetime
2004-06-14 01:00:00 384.5 383.6 0 25
2004-06-14 09:00:00 382.8 382.1 0 25
2004-06-14 17:00:00 384.8 382.5 0 25
2004-06-15 01:00:00 383.3 382.6 0 25
2004-06-15 09:00:00 382.3 381.6 2 25
2004-06-15 17:00:00 388.6 384.6 0 25
2004-06-16 01:00:00 387.3 387.1 0 25
2004-06-16 09:00:00 388.8 387.6 0 25
2004-06-16 17:00:00 384.5 382.6 0 25
2004-06-17 01:00:00 384.6 383.6 0 25
2004-06-17 09:00:00 385.6 384.0 0 25
2004-06-17 17:00:00 386.8 386.0 0 25
2004-06-18 01:00:00 388.6 387.3 0 25
2004-06-18 09:00:00 387.5 385.8 0 25
2004-06-18 17:00:00 395.8 394.1 1 25
2004-06-21 02:00:00 394.3 392.8 0 26
2004-06-21 10:00:00 393.3 392.3 0 26
2004-06-21 18:00:00 394.8 392.1 2 26
2004-06-22 02:00:00 394.6 393.0 0 26
2004-06-22 10:00:00 394.0 392.6 0 26
2004-06-22 18:00:00 395.3 393.8 0 26
2004-06-23 02:00:00 394.3 393.6 0 26
2004-06-23 10:00:00 395.8 395.0 0 26
2004-06-23 18:00:00 394.6 393.6 0 26
2004-06-24 02:00:00 394.6 393.1 0 26
2004-06-24 10:00:00 397.8 394.8 0 26
2004-06-24 18:00:00 401.3 400.6 0 26
2004-06-25 02:00:00 402.1 400.1 0 26
2004-06-25 10:00:00 401.8 400.8 0 26
2004-06-25 18:00:00 402.1 400.8 1 26
2004-06-28 03:00:00 402.3 401.5 1 27
2004-06-28 11:00:00 402.1 400.8 0 27
2004-06-28 19:00:00 400.3 399.1 0 27
2004-06-29 03:00:00 399.6 399.1 0 27
2004-06-29 11:00:00 397.1 395.3 0 27
2004-06-29 19:00:00 392.3 391.0 2 27
2004-06-30 03:00:00 392.3 391.8 0 27
2004-06-30 11:00:00 393.6 393.1 0 27
2004-06-30 19:00:00 393.5 391.3 0 27
HTH
Much like #jrjc approach, but I think this can be done without some assignments, let's try this:
def f(x):
x.loc[x['A'][::-1].idxmax(), 'S'] = 1
lindx = x['B'][::-1].idxmin()
x.loc[lindx, 'S'] = np.where(x.loc[lindx, 'S'] == 1, 0, 2)
return x
p_out = p.groupby(pd.Grouper(freq='W')).apply(f)
Check outputs by looking only at non-zero values of S for p_out:
p_out[p_out.S.ne(0)]
Output:
A B S
datetime
2004-06-15 11:00:00 382.0 381.1 2
2004-06-18 18:00:00 395.8 394.1 1
2004-06-21 18:00:00 394.8 392.1 2
2004-06-25 14:00:00 402.8 401.0 1
2004-06-28 14:00:00 404.6 402.3 1
2004-06-29 17:00:00 394.5 390.3 2

Categories