keytable
Out[66]:
datahora pp pres ... WeekDay Power_kW Power_kW18
Month Day Hour ...
1 3 0 2019-01-03 00:00 0.0 1027.6 ... 3 77.303046 117.774419
1 2019-01-03 01:00 0.0 1027.0 ... 3 72.319602 110.710928
2 2019-01-03 02:00 0.0 1027.0 ... 3 71.831852 106.067667
3 2019-01-03 03:00 0.0 1027.0 ... 3 69.555751 106.325955
4 2019-01-03 04:00 0.0 1027.0 ... 3 69.525780 102.855393
... ... ... ... ... ... ...
12 30 19 2019-12-30 19:00 0.0 1031.5 ... 0 72.590489 89.749535
20 2019-12-30 20:00 0.0 1032.0 ... 0 71.444516 87.691824
21 2019-12-30 21:00 0.0 1032.0 ... 0 68.940099 87.242445
22 2019-12-30 22:00 0.0 1032.0 ... 0 67.244716 83.618018
23 2019-12-30 23:00 0.0 1032.0 ... 0 68.531573 81.288847
[8637 rows x 12 columns]
I have this dataframe and I wish to go through a day's values of 'pp' (precipitation) to see if if it rained in a period of 24, by creating a column called 'rainday' which turns to 1 if a certain threshold of 'pp' is passed during the day. How can I do it?
Use groupby with max and compare with your threshold:
threshold = 1
df["rainday"] = (df.reset_index().groupby(["Month","Day"])["pp"].max()
.gt(threshold).astype(int))
print (df)
datahora pp pres WeekDay Power_kW Power_kW18 rainday
Month Day Hour
1 3 0 2019-01-03 00:00 0.0 1027.6 3 77.303046 117.774419 0
1 2019-01-03 01:00 0.0 1027.0 3 72.319602 110.710928 0
2 2019-01-03 02:00 0.0 1027.0 3 71.831852 106.067667 0
3 2019-01-03 03:00 0.0 1027.0 3 69.555751 106.325955 0
4 2019-01-03 04:00 1.0 1027.0 3 69.525780 102.855393 0
12 30 19 2019-12-30 19:00 0.0 1031.5 0 72.590489 89.749535 1
20 2019-12-30 20:00 0.0 1032.0 0 71.444516 87.691824 1
21 2019-12-30 21:00 0.0 1032.0 0 68.940099 87.242445 1
22 2019-12-30 22:00 1.0 1032.0 0 67.244716 83.618018 1
23 2019-12-30 23:00 2.0 1032.0 0 68.531573 81.288847 1
Related
My two dataframes:
wetter
Out[223]:
level_0 index TEMPERATURE:TOTAL SLP HOUR Time
0 0 2018-01-01 00:00:00 9.8 NaN 00 00:00
1 1 2018-01-01 01:00:00 9.8 NaN 01 01:00
2 2 2018-01-01 02:00:00 9.2 NaN 02 02:00
3 3 2018-01-01 03:00:00 8.4 NaN 03 03:00
4 4 2018-01-01 04:00:00 8.5 NaN 04 04:00
... ... ... ... ... ...
49034 49034 2018-12-31 22:40:00 8.5 NaN 22 22:40
49035 49035 2018-12-31 22:45:00 8.4 NaN 22 22:45
49036 49036 2018-12-31 22:50:00 8.4 NaN 22 22:50
49037 49037 2018-12-31 22:55:00 8.4 NaN 22 22:55
49038 49038 2018-12-31 23:00:00 8.4 NaN 23 23:00
[49039 rows x 6 columns]
df
Out[224]:
0 Time -14 -13 ... 17 18 NaN
1 00:00 1,256326635 1,218256131 ... 0,080348715 0,040194189 00:15
2 00:15 1,256564788 1,218487067 ... 0,080254367 0,039517006 00:30
3 00:30 1,260350982 1,222158528 ... 0,080219518 0,039054261 00:45
4 00:45 1,259306606 1,221145800 ... 0,080758578 0,039176953 01:00
5 01:00 1,258521518 1,220384502 ... 0,080444585 0,038164953 01:15
.. ... ... ... ... ... ... ...
92 22:45 1,253545107 1,215558891 ... 0,080164570 0,042697436 23:00
93 23:00 1,241253483 1,203639741 ... 0,078395829 0,039685235 23:15
94 23:15 1,242890274 1,205226933 ... 0,078801415 0,039170364 23:30
95 23:30 1,240459118 1,202869448 ... 0,079511294 0,039013684 23:45
96 23:45 1,236228281 1,198766818 ... 0,079186806 0,037570494 00:00
[96 rows x 35 columns]
I want to fill the SLP column of wetter based on TEMPERATURE:TOTAL and Time.
For this I want to look at the df dataframe and fill SLP depending on the columns of df, where the headers are temperatures.
So for the first TEMPERATURE:TOTAL of 9.8 at 00:00, SLP should be filled with the value of the column that is simply called 9 in row 00:00 of Time
I have tried to do this, which is why i also created the Time columns but I am stuck. I thought of some nested loops but knowing a bit of pandas I guess there is probably a two-liner solution for this?
Here is one way!
import numpy as np
import pandas as pd
This is me simulating your dataframes (you are free to skip this step) - next time please provide them.
wetter = pd.DataFrame()
df = pd.DataFrame()
wetter['TEMPERATURE:TOTAL'] = np.random.rand(10) * 10
wetter['SLP'] = np.nan * 10
wetter['Time'] = pd.date_range("00:00", periods=10, freq="H")
df['Time'] = pd.date_range("00:00", periods=10, freq="15T")
for i in range(-14, 18):
df[i] = np.random.rand(10)
Preprocess:
wetter['temp'] = np.floor(wetter['TEMPERATURE:TOTAL'])
wetter = wetter.astype({'temp': 'int'})
wetter.set_index('Time')
df.set_index('Time')
df = df.reset_index()
value_vars_ = list(range(-14, 18))
df_long = pd.melt(df, id_vars='Time', value_vars=value_vars_, var_name='temp', value_name="SLP")
Left-join two dataframes on Time and temp:
final = pd.merge(wetter.drop('SLP', axis=1), df_long, how="left", on=["Time", "temp"])
I am struggling to find a way to multiply monthly values from a df by a monthly index using python pandas.
I explain:
I have a traditional time serie dataset as such:
Id AAA BBBB CCCC
2017-03-31 0 0 0
2017-04-30 0 0 0
2017-05-31 0 0 0
2017-06-30 0 0 0
2017-07-31 0 0 0
2017-08-31 0 0 0
2017-09-30 0 0 0
2017-10-31 0 0 0
2017-11-30 0 0 0
2017-12-31 0 0 0
2018-01-31 1 0 0
2018-02-28 0 204 0
2018-03-31 0 0 0
2018-04-30 0 0 0
2018-05-31 80 130 0
2018-06-30 0 0 0
2018-07-31 5 252 0
2018-08-31 0 290 0
2018-09-30 0 0 0
2018-10-31 1 230 0
2018-11-30 92 60 0
2018-12-31 0 0 0
2019-01-31 0 40 0
2019-02-28 16 48 0
2019-03-31 0 0 0
2019-04-30 0 224 0
2019-05-31 30 270 0
2019-06-30 0 0 0
2019-07-31 13 284 0
2019-08-31 0 0 0
2019-09-30 0 112 0
2019-10-31 0 134 0
2019-11-30 0 0 0
2019-12-31 0 50 0
2020-01-31 0 0 0
2020-02-29 0 0 0
2020-03-31 0 0 0
2020-04-30 10 67 0
2020-05-31 0 0 0
2020-06-30 0 54 1
and I have monthly indexes:
Id AAA BBBB CCCC
1 0.055046 0.212131 0.0
2 0.880734 1.336427 0.0
3 0.000000 0.000000 0.0
4 0.412844 1.157441 0.0
5 4.541284 1.590984 0.0
6 0.000000 0.214783 12.0
7 0.990826 2.842559 0.0
8 0.000000 1.537952 0.0
9 0.000000 0.593968 0.0
10 0.055046 1.930394 0.0
11 5.064220 0.318197 0.0
12 0.000000 0.265164 0.0
the goal is to divide each month of first dataset by its corresponding index from the second dataset.
I.E the value for product AAA from the date 2019-06-30 should be divided by the seasonal index with index 6
how can this be done in pandas?
Quick and dirty: pd.merge() with left_on being the month of the datetime index, right_on being the index (Id). The element-wise quotient can be computed subsequently.
Data
import pandas as pd
import io
df1 = pd.read_csv(io.StringIO("""
Id AAA BBBB CCCC
2017-03-31 0 0 0
2017-04-30 0 0 0
2017-05-31 0 0 0
2017-06-30 0 0 0
2017-07-31 0 0 0
2017-08-31 0 0 0
2017-09-30 0 0 0
2017-10-31 0 0 0
2017-11-30 0 0 0
2017-12-31 0 0 0
2018-01-31 1 0 0
2018-02-28 0 204 0
2018-03-31 0 0 0
2018-04-30 0 0 0
2018-05-31 80 130 0
2018-06-30 0 0 0
2018-07-31 5 252 0
2018-08-31 0 290 0
2018-09-30 0 0 0
2018-10-31 1 230 0
2018-11-30 92 60 0
2018-12-31 0 0 0
2019-01-31 0 40 0
2019-02-28 16 48 0
2019-03-31 0 0 0
2019-04-30 0 224 0
2019-05-31 30 270 0
2019-06-30 0 0 0
2019-07-31 13 284 0
2019-08-31 0 0 0
2019-09-30 0 112 0
2019-10-31 0 134 0
2019-11-30 0 0 0
2019-12-31 0 50 0
2020-01-31 0 0 0
2020-02-29 0 0 0
2020-03-31 0 0 0
2020-04-30 10 67 0
2020-05-31 0 0 0
2020-06-30 0 54 1
"""), sep=r"\s{2,}", engine="python")
df1["Id"] = pd.to_datetime(df1["Id"])
df1.set_index("Id", inplace=True)
df2 = pd.read_csv(io.StringIO("""
Id AAA BBBB CCCC
1 0.055046 0.212131 0.0
2 0.880734 1.336427 0.0
3 0.000000 0.000000 0.0
4 0.412844 1.157441 0.0
5 4.541284 1.590984 0.0
6 0.000000 0.214783 12.0
7 0.990826 2.842559 0.0
8 0.000000 1.537952 0.0
9 0.000000 0.593968 0.0
10 0.055046 1.930394 0.0
11 5.064220 0.318197 0.0
12 0.000000 0.265164 0.0
"""), sep=r"\s{2,}", engine="python")
df2.set_index("Id", inplace=True)
Code
df_joined = df1.merge(df2, how="left", left_on=df1.index.month, right_on="Id")
for col in df1.columns: # or ["AAA", "BBBB", "CCCC"]
df1[col] = (df_joined[f"{col}_x"] / df_joined[f"{col}_y"]).values
del df_joined
Result
print(df1)
AAA BBBB CCCC
Id
2017-03-31 NaN NaN NaN
2017-04-30 0.000000 0.000000 NaN
2017-05-31 0.000000 0.000000 NaN
2017-06-30 NaN 0.000000 0.000000
2017-07-31 0.000000 0.000000 NaN
2017-08-31 NaN 0.000000 NaN
2017-09-30 NaN 0.000000 NaN
2017-10-31 0.000000 0.000000 NaN
2017-11-30 0.000000 0.000000 NaN
2017-12-31 NaN 0.000000 NaN
2018-01-31 18.166624 0.000000 NaN
2018-02-28 0.000000 152.645824 NaN
2018-03-31 NaN NaN NaN
2018-04-30 0.000000 0.000000 NaN
2018-05-31 17.616163 81.710438 NaN
2018-06-30 NaN 0.000000 0.000000
2018-07-31 5.046295 88.652513 NaN
2018-08-31 NaN 188.562452 NaN
2018-09-30 NaN 0.000000 NaN
2018-10-31 18.166624 119.146661 NaN
2018-11-30 18.166667 188.562431 NaN
2018-12-31 NaN 0.000000 NaN
2019-01-31 0.000000 188.562728 NaN
2019-02-28 18.166666 35.916664 NaN
2019-03-31 NaN NaN NaN
2019-04-30 0.000000 193.530383 NaN
2019-05-31 6.606061 169.706295 NaN
2019-06-30 NaN 0.000000 0.000000
2019-07-31 13.120366 99.909975 NaN
2019-08-31 NaN 0.000000 NaN
2019-09-30 NaN 188.562347 NaN
2019-10-31 0.000000 69.415881 NaN
2019-11-30 0.000000 0.000000 NaN
2019-12-31 NaN 188.562550 NaN
2020-01-31 0.000000 0.000000 NaN
2020-02-29 0.000000 0.000000 NaN
2020-03-31 NaN NaN NaN
2020-04-30 24.222224 57.886320 NaN
2020-05-31 0.000000 0.000000 NaN
2020-06-30 NaN 251.416546 0.083333
Whish to have time duration/accumulation of time diff as long as "state" == 1 is active and else 'off'
timestamp state
2020-01-01 00:00:00 0
2020-01-01 00:00:01 0
2020-01-01 00:00:02 0
2020-01-01 00:00:03 1
2020-01-01 00:00:04 1
2020-01-01 00:00:05 1
2020-01-01 00:00:06 1
2020-01-01 00:00:07 0
2020-01-01 00:00:08 0
2020-01-01 00:00:09 0
2020-01-01 00:00:10 0
2020-01-01 00:00:11 1
2020-01-01 00:00:12 1
2020-01-01 00:00:13 1
2020-01-01 00:00:14 1
2020-01-01 00:00:15 1
2020-01-01 00:00:16 1
2020-01-01 00:00:17 0
2020-01-01 00:00:18 0
2020-01-01 00:00:19 0
2020-01-01 00:00:20 0
Based on a similar question, I tried something with groupby, however, the code ignores to stop doing timediff when "state" == 0.
I also tried to apply a lambda function (commented) but an error pops up sayin "KeyError: ('state', 'occurred at index timestamp')"
Any idea how to do that better ?
import numpy as np
import pandas as pd
dt = pd.date_range('2020-01-01', '2020-01-01 00:00:20',freq='1s')
s = [0,0,0,1,1,1,1,0,0,0,0,1,1,1,1,1,1,0,0,0,0]
df = pd.DataFrame({'timestamp': dt,
'state': s})
df['timestamp']=pd.to_datetime(df.timestamp, format='%Y-%m-%d %H:%M:%S')
df['tdiff']=(df.groupby('state').diff().timestamp.values/60)
#df['tdiff'] = df.apply(lambda x: x['timestamp'].diff().state.values/60 if x['state'] == 1 else 'off')
The desired output should be:
timestamp state tdiff accum.
2020-01-01 00:00:00 0 off 0
2020-01-01 00:00:01 0 off 0
2020-01-01 00:00:02 0 off 0
2020-01-01 00:00:03 1 nan 0
2020-01-01 00:00:04 1 1.0 1.0
2020-01-01 00:00:05 1 1.0 2.0
2020-01-01 00:00:06 1 1.0 3.0
2020-01-01 00:00:07 0 off 0
2020-01-01 00:00:08 0 off 0
2020-01-01 00:00:09 0 off 0
2020-01-01 00:00:10 0 off 0
2020-01-01 00:00:11 1 nan 0
2020-01-01 00:00:12 1 1.0 1.0
2020-01-01 00:00:13 1 1.0 2.0
2020-01-01 00:00:14 1 1.0 3.0
2020-01-01 00:00:15 1 1.0 4.0
2020-01-01 00:00:16 1 1.0 5.0
You can check with groupby with cumsum for the additional groupkey
g = df.loc[df['state'].ne(0)].groupby(df['state'].eq(0).cumsum())['timestamp']
s1 = g.diff().dt.total_seconds()
s2 = g.apply(lambda x : x.diff().dt.total_seconds().cumsum())
df['tdiff'] = 'off'
df.loc[df['state'].ne(0),'tdiff'] = s1
df['accum'] = s2
# notice I did not fillna with 0, you can do it with df['accum'].fillna(0,inplace=True)
df
Out[53]:
timestamp state tdiff accum
0 2020-01-01 00:00:00 0 off NaN
1 2020-01-01 00:00:01 0 off NaN
2 2020-01-01 00:00:02 0 off NaN
3 2020-01-01 00:00:03 1 NaN NaN
4 2020-01-01 00:00:04 1 1 1.0
5 2020-01-01 00:00:05 1 1 2.0
6 2020-01-01 00:00:06 1 1 3.0
7 2020-01-01 00:00:07 0 off NaN
8 2020-01-01 00:00:08 0 off NaN
9 2020-01-01 00:00:09 0 off NaN
10 2020-01-01 00:00:10 0 off NaN
11 2020-01-01 00:00:11 1 NaN NaN
12 2020-01-01 00:00:12 1 1 1.0
13 2020-01-01 00:00:13 1 1 2.0
14 2020-01-01 00:00:14 1 1 3.0
15 2020-01-01 00:00:15 1 1 4.0
16 2020-01-01 00:00:16 1 1 5.0
17 2020-01-01 00:00:17 0 off NaN
18 2020-01-01 00:00:18 0 off NaN
19 2020-01-01 00:00:19 0 off NaN
20 2020-01-01 00:00:20 0 off NaN
I have a dataframe like this:
Start date end date A B
01.01.2020 30.06.2020 2 3
01.01.2020 31.12.2020 3 1
01.04.2020 30.04.2020 6 2
01.01.2021 31.12.2021 2 3
01.07.2020 31.12.2020 8 2
01.01.2020 31.12.2023 1 2
.......
I would like to split the rows where end - start > 1 year (see last row where end=2023 and start = 2020), keeping the same value for column A, while splitting proportionally the value in column B:
Start date end date A B
01.01.2020 30.06.2020 2 3
01.01.2020 31.12.2020 3 1
01.04.2020 30.04.2020 6 2
01.01.2021 31.12.2021 2 3
01.07.2020 31.12.2020 8 2
01.01.2020 31.12.2020 1 2/4
01.01.2021 31.12.2021 1 2/4
01.01.2022 31.12.2022 1 2/4
01.01.2023 31.12.2023 1 2/4
.......
Any idea?
Here is my solution. See the comments below:
import io
# TEST DATA:
text=""" start end A B
01.01.2020 30.06.2020 2 3
01.01.2020 31.12.2020 3 1
01.04.2020 30.04.2020 6 2
01.01.2021 31.12.2021 2 3
01.07.2020 31.12.2020 8 2
31.12.2020 20.01.2021 12 12
31.12.2020 01.01.2021 22 22
30.12.2020 01.01.2021 32 32
10.05.2020 28.09.2023 44 44
27.11.2020 31.12.2023 88 88
31.12.2020 31.12.2023 100 100
01.01.2020 31.12.2021 200 200
"""
df= pd.read_csv(io.StringIO(text), sep=r"\s+", engine="python", parse_dates=[0,1])
#print("\n----\n df:",df)
#----------------------------------------
# SOLUTION:
def split_years(r):
"""
Split row 'r' where "end"-"start" greater than 0.
The new rows have repeated values of 'A', and 'B' divided by the number of years.
Return: a DataFrame with rows per year.
"""
t1,t2 = r["start"], r["end"]
ys= t2.year - t1.year
kk= 0 if t1.is_year_end else 1
if ys>0:
l1=[t1] + [ t1+pd.offsets.YearBegin(i) for i in range(1,ys+1) ]
l2=[ t1+pd.offsets.YearEnd(i) for i in range(kk,ys+kk) ] + [t2]
return pd.DataFrame({"start":l1, "end":l2, "A":r.A,"B": r.B/len(l1)})
print("year difference <= 0!")
return None
# Create two groups, one for rows where the 'start' and 'end' is in the same year, and one for the others:
grps= df.groupby(lambda idx: (df.loc[idx,"start"].year-df.loc[idx,"end"].year)!=0 ).groups
print("\n---- grps:\n",grps)
# Extract the "one year" rows in a data frame:
df1= df.loc[grps[False]]
#print("\n---- df1:\n",df1)
# Extract the rows to be splitted:
df2= df.loc[grps[True]]
print("\n---- df2:\n",df2)
# Split the rows and put the resulting data frames into a list:
ldfs=[ split_years(df2.loc[row]) for row in df2.index ]
print("\n---- ldfs:")
for fr in ldfs:
print(fr,"\n")
# Insert the "one year" data frame to the list, and concatenate them:
ldfs.insert(0,df1)
df_rslt= pd.concat(ldfs,sort=False)
#print("\n---- df_rslt:\n",df_rslt)
# Housekeeping:
df_rslt= df_rslt.sort_values("start").reset_index(drop=True)
print("\n---- df_rslt:\n",df_rslt)
Outputs:
---- grps:
{False: Int64Index([0, 1, 2, 3, 4], dtype='int64'), True: Int64Index([5, 6, 7, 8, 9, 10, 11], dtype='int64')}
---- df2:
start end A B
5 2020-12-31 2021-01-20 12 12
6 2020-12-31 2021-01-01 22 22
7 2020-12-30 2021-01-01 32 32
8 2020-10-05 2023-09-28 44 44
9 2020-11-27 2023-12-31 88 88
10 2020-12-31 2023-12-31 100 100
11 2020-01-01 2021-12-31 200 200
---- ldfs:
start end A B
0 2020-12-31 2020-12-31 12 6.0
1 2021-01-01 2021-01-20 12 6.0
start end A B
0 2020-12-31 2020-12-31 22 11.0
1 2021-01-01 2021-01-01 22 11.0
start end A B
0 2020-12-30 2020-12-31 32 16.0
1 2021-01-01 2021-01-01 32 16.0
start end A B
0 2020-10-05 2020-12-31 44 11.0
1 2021-01-01 2021-12-31 44 11.0
2 2022-01-01 2022-12-31 44 11.0
3 2023-01-01 2023-09-28 44 11.0
start end A B
0 2020-11-27 2020-12-31 88 22.0
1 2021-01-01 2021-12-31 88 22.0
2 2022-01-01 2022-12-31 88 22.0
3 2023-01-01 2023-12-31 88 22.0
start end A B
0 2020-12-31 2020-12-31 100 25.0
1 2021-01-01 2021-12-31 100 25.0
2 2022-01-01 2022-12-31 100 25.0
3 2023-01-01 2023-12-31 100 25.0
start end A B
0 2020-01-01 2020-12-31 200 100.0
1 2021-01-01 2021-12-31 200 100.0
---- df_rslt:
start end A B
0 2020-01-01 2020-06-30 2 3.0
1 2020-01-01 2020-12-31 3 1.0
2 2020-01-01 2020-12-31 200 100.0
3 2020-01-04 2020-04-30 6 2.0
4 2020-01-07 2020-12-31 8 2.0
5 2020-10-05 2020-12-31 44 11.0
6 2020-11-27 2020-12-31 88 22.0
7 2020-12-30 2020-12-31 32 16.0
8 2020-12-31 2020-12-31 12 6.0
9 2020-12-31 2020-12-31 100 25.0
10 2020-12-31 2020-12-31 22 11.0
11 2021-01-01 2021-12-31 100 25.0
12 2021-01-01 2021-12-31 88 22.0
13 2021-01-01 2021-12-31 44 11.0
14 2021-01-01 2021-01-01 32 16.0
15 2021-01-01 2021-01-01 22 11.0
16 2021-01-01 2021-01-20 12 6.0
17 2021-01-01 2021-12-31 2 3.0
18 2021-01-01 2021-12-31 200 100.0
19 2022-01-01 2022-12-31 88 22.0
20 2022-01-01 2022-12-31 100 25.0
21 2022-01-01 2022-12-31 44 11.0
22 2023-01-01 2023-09-28 44 11.0
23 2023-01-01 2023-12-31 88 22.0
24 2023-01-01 2023-12-31 100 25.0
Bit of a different approach, adding new columns instead of new rows. But I think this accomplishes what you want to do.
df["years_apart"] = (
(df["end_date"] - df["start_date"]).dt.days / 365
).astype(int)
for years in range(1, df["years_apart"].max().astype(int)):
df[f"{years}_end_date"] = pd.NaT
df.loc[
df["years_apart"] == years, f"{years}_end_date"
] = df.loc[
df["years_apart"] == years, "start_date"
] + dt.timedelta(days=365*years)
df["B_bis"] = df["B"] / df["years_apart"]
Output
start_date end_date years_apart 1_end_date 2_end_date ...
2018-01-01 2018-01-02 0 NaT NaT
2018-01-02 2019-01-02 1 2019-01-02 NaT
2018-01-03 2020-01-03 2 NaT 2020-01-03
I have solved it creating a date difference and a counter that adds years to the repeated rows:
#calculate difference between start and end year
table['diff'] = (table['end'] - table['start'])//timedelta(days=365)
table['diff'] = table['diff']+1
#replicate rows depending on number of years
table = table.reindex(table.index.repeat(table['diff']))
#counter that increase for diff>1, assign increasing years to the replicated rows
table['count'] = table['diff'].groupby(table['diff']).cumsum()//table['diff']
table['start'] = np.where(table['diff']>1, table['start']+table['count']-1, table['start'])
table['end'] = table['start']
#split B among years
table['B'] = table['B']//table['diff']
I have two dataframes that I need to merge based on date. The first dataframe looks like:
Time Stamp HP_1H_mean Coolant1_1H_mean Extreme_1H_mean
0 2019-07-26 07:00:00 410.637966 414.607081 0.0
1 2019-07-26 08:00:00 403.521735 424.787366 0.0
2 2019-07-26 09:00:00 403.143925 425.739639 0.0
3 2019-07-26 10:00:00 410.542895 426.210538 0.0
...
17 2019-07-27 00:00:00 0.000000 0.000000 0.0
18 2019-07-27 01:00:00 0.000000 0.000000 0.0
19 2019-07-27 02:00:00 0.000000 0.000000 0.0
20 2019-07-27 03:00:00 0.000000 0.000000 0.0
The second is like this:
Time Stamp Qty Compl
0 2019-07-26 150
1 2019-07-27 20
2 2019-07-29 230
3 2019-07-30 230
4 2019-07-31 170
Both Time Stamp columns are datetime64[ns]. I wanted to merge left, and forward fill the date into all the other rows for a day. My problem is at the merge, the Qty Compl from the second df is applied at midnight of each day, and some days does not have a midnight time stamp, such as the first day in the first dataframe.
Is there a way to merge and match every row that contains the same day? The desired output would look like this:
Time Stamp HP_1H_mean Coolant1_1H_mean Extreme_1H_mean Qty Compl
0 2019-07-26 07:00:00 410.637966 414.607081 0.0 150
1 2019-07-26 08:00:00 403.521735 424.787366 0.0 150
2 2019-07-26 09:00:00 403.143925 425.739639 0.0 150
3 2019-07-26 10:00:00 410.542895 426.210538 0.0 150
...
17 2019-07-27 00:00:00 0.000000 0.000000 0.0 20
18 2019-07-27 01:00:00 0.000000 0.000000 0.0 20
19 2019-07-27 02:00:00 0.000000 0.000000 0.0 20
20 2019-07-27 03:00:00 0.000000 0.000000 0.0 20
Use merge_asof with sorted both DataFrames by datetimes:
#if necessary
df1['Time Stamp'] = pd.to_datetime(df1['Time Stamp'])
df2['Time Stamp'] = pd.to_datetime(df2['Time Stamp'])
df1 = df1.sort_values('Time Stamp')
df2 = df2.sort_values('Time Stamp')
df = pd.merge_asof(df1, df2, on='Time Stamp')
print (df)
Time Stamp HP_1H_mean Coolant1_1H_mean Extreme_1H_mean \
0 2019-07-26 07:00:00 410.637966 414.607081 0.0
1 2019-07-26 08:00:00 403.521735 424.787366 0.0
2 2019-07-26 09:00:00 403.143925 425.739639 0.0
3 2019-07-26 10:00:00 410.542895 426.210538 0.0
4 2019-07-27 00:00:00 0.000000 0.000000 0.0
5 2019-07-27 01:00:00 0.000000 0.000000 0.0
6 2019-07-27 02:00:00 0.000000 0.000000 0.0
7 2019-07-27 03:00:00 0.000000 0.000000 0.0
Qty Compl
0 150
1 150
2 150
3 150
4 20
5 20
6 20
7 20