I have a df with index extending past the last data point:
df
2022-01-31 96.210 21649.6
2022-02-28 96.390 21708.4
2022-03-31 97.410 21739.7
2022-04-30 98.630 21644.3
2022-05-31 103.744 21649.2
2022-06-30 102.498 21607.4
2022-07-31 105.138 21636.1
2022-08-31 105.450 21631.8
2022-09-30 109.691 21503.1
2022-10-31 111.745 21414.8
2022-11-30 111.481 21351.6
2022-12-31 104.728 NaN
2023-01-31 103.522 NaN
2023-02-28 NaN
2023-03-31 NaN
2023-04-30 NaN
2023-05-31 NaN
2023-06-30 NaN
2023-07-31 NaN
2023-08-31 NaN
and when I compute pct_change, pandas treats NaNs as values and extends the pct_change calculation past the last actual data point. So instead stopping on 2023-01-31 for the first column, pandas continues computing values for 2023-02-28 and so on:
df.pct_change(12)
2022-01-31 0.069713 0.117543
2022-02-28 0.059464 0.106713
2022-03-31 0.069969 0.094989
2022-04-30 0.061336 0.076258
2022-05-31 0.140671 0.060263
2022-06-30 0.141022 0.056075
2022-07-31 0.135400 0.049517
2022-08-31 0.145573 0.038228
2022-09-30 0.186490 0.025632
2022-10-31 0.188271 0.012812
2022-11-30 0.187484 0.000112
2022-12-31 0.090576 -0.006440
2023-01-31 0.076000 -0.013765
2023-02-28 0.073991 -0.016436
2023-03-31 0.062745 -0.017852
2023-04-30 0.049600 -0.013523
2023-05-31 -0.002140 -0.013746
2023-06-30 0.009990 -0.011839
2023-07-31 -0.015370 -0.013149
2023-08-31 -0.018284 -0.012953
How can I tell pandas to compute pct_change only upto NaN values? So the output is:
2022-01-31 0.069713 0.117543
2022-02-28 0.059464 0.106713
2022-03-31 0.069969 0.094989
2022-04-30 0.061336 0.076258
2022-05-31 0.140671 0.060263
2022-06-30 0.141022 0.056075
2022-07-31 0.135400 0.049517
2022-08-31 0.145573 0.038228
2022-09-30 0.186490 0.025632
2022-10-31 0.188271 0.012812
2022-11-30 0.187484 0.000112
2022-12-31 0.090576 NaN
2023-01-31 0.076000 NaN
2023-02-28 NaN NaN
2023-03-31 NaN NaN
2023-04-30 NaN NaN
2023-05-31 NaN NaN
2023-06-30 NaN NaN
2023-07-31 NaN NaN
2023-08-31 NaN NaN
The following options won't work for me:
dropna
specifying a range a-la df=df[:'2023-01-31'].pct_change(12)
I need to keep the index past the last data point and I am doing a lot of different pct_changes so specifying a range everytime is too time consuming, I am looking for a nicer solution.
Also, specifying a range won't work because when when the df gets new data points, I will have to manually adjust all of the ranges, which is no bueno.
If you look at the documentation for df.pct_change, you'll find that it has a parameter fill_method that uses pad as the default for handling NaN values before computing the changes. pad (or ffill) means that the function is propagating the last valid observation forward. E.g. in your first column, with periods=12, when you reach the index value 2023-02-28, instead of using the NaN value, the function picks the last valid value (103.522 at 2023-01-31) to do the calculation:
103.522/96.390-1
# 0.0739910779126467 (96.390 being the value at `2022-02-28`, one year earlier)
To avoid this, you want to set fill_method to bfill (cf. the documentation for df.fillna for these methods). Result examplified below with periods=3 (I trust you are only showing the last rows of a longer df, because otherwise all the values before 2023-01-31 should be NaN values):
df.pct_change(periods=3, fill_method='bfill')
col1 col2
2022-01-31 NaN NaN
2022-02-28 NaN NaN
2022-03-31 NaN NaN
2022-04-30 0.025153 -0.000245
2022-05-31 0.076294 -0.002727
2022-06-30 0.052233 -0.006086
2022-07-31 0.065984 -0.000379
2022-08-31 0.016444 -0.000804
2022-09-30 0.070177 -0.004827
2022-10-31 0.062841 -0.010228
2022-11-30 0.057193 -0.012953
2022-12-31 -0.045245 NaN
2023-01-31 -0.073587 NaN
2023-02-28 NaN NaN
2023-03-31 NaN NaN
2023-04-30 NaN NaN
2023-05-31 NaN NaN
2023-06-30 NaN NaN
2023-07-31 NaN NaN
2023-08-31 NaN NaN
You can put ptc_change(limit=1), which will stop comparison further after 1 null comparison.
My goal is selecting the column Sabah in dataframe prdt and entering every value to repeated rows called Sabah in dataframe prcal
prcal
Vakit Start_Date End_Date Start_Time End_Time
0 Sabah 2022-01-01 2022-01-01 NaN NaN
1 Güneş 2022-01-01 2022-01-01 NaN NaN
2 Öğle 2022-01-01 2022-01-01 NaN NaN
3 İkindi 2022-01-01 2022-01-01 NaN NaN
4 Akşam 2022-01-01 2022-01-01 NaN NaN
..........................................................
2184 Sabah 2022-12-31 2022-12-31 NaN NaN
2185 Güneş 2022-12-31 2022-12-31 NaN NaN
2186 Öğle 2022-12-31 2022-12-31 NaN NaN
2187 İkindi 2022-12-31 2022-12-31 NaN NaN
2188 Akşam 2022-12-31 2022-12-31 NaN NaN
2189 rows × 5 columns
prdt
Day Sabah Güneş Öğle İkindi Akşam Yatsı
0 2022-01-01 06:51:00 08:29:00 13:08:00 15:29:00 17:47:00 19:20:00
1 2022-01-02 06:51:00 08:29:00 13:09:00 15:30:00 17:48:00 19:21:00
2 2022-01-03 06:51:00 08:29:00 13:09:00 15:30:00 17:48:00 19:22:00
3 2022-01-04 06:51:00 08:29:00 13:09:00 15:31:00 17:49:00 19:22:00
4 2022-01-05 06:51:00 08:29:00 13:10:00 15:32:00 17:50:00 19:23:00
...........................................................................
360 2022-12-27 06:49:00 08:27:00 13:06:00 15:25:00 17:43:00 19:16:00
361 2022-12-28 06:50:00 08:28:00 13:06:00 15:26:00 17:43:00 19:17:00
362 2022-12-29 06:50:00 08:28:00 13:07:00 15:26:00 17:44:00 19:18:00
363 2022-12-30 06:50:00 08:28:00 13:07:00 15:27:00 17:45:00 19:18:00
364 2022-12-31 06:50:00 08:28:00 13:07:00 15:28:00 17:46:00 19:19:00
365 rows × 7 columns
Selected every row called sabah prcal.iloc[::6,:]
Made a list for prdt['Sabah'].
When integrating prcal.iloc[::6,:] = prdt['Sabah'][0:365] I get a value error:
ValueError: Must have equal len keys and value when setting with an iterable
I have a df sample with one of the columns named date_code, dtype datetime64[ns]:
date_code
2022-03-28
2022-03-29
2022-03-30
2022-03-31
2022-04-01
2022-04-07
2022-04-07
2022-04-08
2022-04-12
2022-04-12
2022-04-14
2022-04-14
2022-04-15
2022-04-16
2022-04-16
2022-04-17
2022-04-18
2022-04-19
2022-04-20
2022-04-20
2022-04-21
2022-04-22
2022-04-25
2022-04-25
2022-04-26
I would like to create a column based on some conditions comparing current row with previous. I trying to create a function like:
def start_date(row):
if (row['date_code'] - row['date_code'].shift(-1)).days >1:
val = row['date_code'].shift(-1)
elif row['date_code'] == row['date_code'].shift(-1):
val = row['date_code']
else:
val = np.nan()
return val
But once I apply
sample['date_zero_recorded'] = sample.apply(start_date, axis=1)
I get error:
AttributeError: 'Timestamp' object has no attribute 'shift'
How I should compare current row with previous with condition?
Edited: expected outoput
if current row more than previous by 2 or more, get previous
if current row equal past, get current
else, return NaN (incl. if current >1 than previous)
date_code date_zero_recorded
2022-03-28 NaN
2022-03-29 NaN
2022-03-30 NaN
2022-03-31 NaN
2022-04-01 NaN
2022-04-07 2022-04-01
2022-04-07 2022-04-07
2022-04-08 NaN
2022-04-12 2022-04-08
2022-04-12 2022-04-12
2022-04-14 2022-04-12
2022-04-14 2022-04-14
2022-04-15 NaN
2022-04-16 NaN
2022-04-16 2022-04-16
2022-04-17 NaN
2022-04-18 NaN
2022-04-19 NaN
2022-04-20 NaN
2022-04-20 2022-04-20
2022-04-21 NaN
2022-04-22 NaN
2022-04-25 2022-04-22
2022-04-25 2022-04-25
2022-04-26 NaN
You shouldn't use iterrows and use vectorial code instead.
For example:
sample['date_code'] = pd.to_datetime(sample['date_code'])
sample['date_zero_recorded'] = (
sample['date_code'].shift()
.where(sample['date_code'].diff().ne('1d'))
)
output:
date_code date_zero_recorded
0 2022-03-28 NaT
1 2022-03-29 NaT
2 2022-03-30 NaT
3 2022-03-31 NaT
4 2022-04-01 NaT
5 2022-04-07 2022-04-01
6 2022-04-07 2022-04-07
7 2022-04-08 NaT
8 2022-04-12 2022-04-08
9 2022-04-12 2022-04-12
10 2022-04-14 2022-04-12
11 2022-04-14 2022-04-14
12 2022-04-15 NaT
13 2022-04-16 NaT
14 2022-04-16 2022-04-16
15 2022-04-17 NaT
16 2022-04-18 NaT
17 2022-04-19 NaT
18 2022-04-20 NaT
19 2022-04-20 2022-04-20
20 2022-04-21 NaT
21 2022-04-22 NaT
22 2022-04-25 2022-04-22
23 2022-04-25 2022-04-25
24 2022-04-26 NaT
I'm trying to merge two dataframes by time with multiple matches. I'm looking for all the instances of df2 whose timestamp falls 7 days or less before endofweek in df1. There may be more than one record that fits the case, and I want all of the matches, not just the first or last (which pd.merge_asof does).
import pandas as pd
df1 = pd.DataFrame({'endofweek': ['2019-08-31', '2019-08-31', '2019-09-07', '2019-09-07', '2019-09-14', '2019-09-14'], 'GroupCol': [1234,8679,1234,8679,1234,8679]})
df2 = pd.DataFrame({'timestamp': ['2019-08-30 10:00', '2019-08-30 10:30', '2019-09-07 12:00', '2019-09-08 14:00'], 'GroupVal': [1234, 1234, 8679, 1234], 'TextVal': ['1234_1', '1234_2', '8679_1', '1234_3']})
df1['endofweek'] = pd.to_datetime(df1['endofweek'])
df2['timestamp'] = pd.to_datetime(df2['timestamp'])
I've tried
pd.merge_asof(df1, df2, tolerance=pd.Timedelta('7d'), direction='backward', left_on='endofweek', right_on='timestamp', left_by='GroupCol', right_by='GroupVal')
but that gets me
endofweek GroupCol timestamp GroupVal TextVal
0 2019-08-31 1234 2019-08-30 10:30:00 1234.0 1234_2
1 2019-08-31 8679 NaT NaN NaN
2 2019-09-07 1234 NaT NaN NaN
3 2019-09-07 8679 NaT NaN NaN
4 2019-09-14 1234 2019-09-08 14:00:00 1234.0 1234_3
5 2019-09-14 8679 2019-09-07 12:00:00 8679.0 8679_1
I'm losing the text 1234_1. Is there way to do a sort of outer join for pd.merge_asof, where I can keep all of the instances of df2 and not just the first or last?
My ideal result would look like this (assuming that the endofweek times are treated like 00:00:00 on that date):
endofweek GroupCol timestamp GroupVal TextVal
0 2019-08-31 1234 2019-08-30 10:00:00 1234.0 1234_1
1 2019-08-31 1234 2019-08-30 10:30:00 1234.0 1234_2
2 2019-08-31 8679 NaT NaN NaN
3 2019-09-07 1234 NaT NaN NaN
4 2019-09-07 8679 NaT NaN NaN
5 2019-09-14 1234 2019-09-08 14:00:00 1234.0 1234_3
6 2019-09-14 8679 2019-09-07 12:00:00 8679.0 8679_1
pd.merge_asof only does a left join. After a lot of frustration trying to speed up the groupby/merge_ordered example, it's more intuitive and faster to do pd.merge_asof on both data sources in different directions, and then do an outer join to combine them.
left_merge = pd.merge_asof(df1, df2,
tolerance=pd.Timedelta('7d'), direction='backward',
left_on='endofweek', right_on='timestamp',
left_by='GroupCol', right_by='GroupVal')
right_merge = pd.merge_asof(df2, df1,
tolerance=pd.Timedelta('7d'), direction='forward',
left_on='timestamp', right_on='endofweek',
left_by='GroupVal', right_by='GroupCol')
merged = (left_merge.merge(right_merge, how="outer")
.sort_values(['endofweek', 'GroupCol', 'timestamp'])
.reset_index(drop=True))
merged
endofweek GroupCol timestamp GroupVal TextVal
0 2019-08-31 1234 2019-08-30 10:00:00 1234.0 1234_1
1 2019-08-31 1234 2019-08-30 10:30:00 1234.0 1234_2
2 2019-08-31 8679 NaT NaN NaN
3 2019-09-07 1234 NaT NaN NaN
4 2019-09-07 8679 NaT NaN NaN
5 2019-09-14 1234 2019-09-08 14:00:00 1234.0 1234_3
6 2019-09-14 8679 2019-09-07 12:00:00 8679.0 8679_1
In addition, it is much faster than my other answer:
import time
n=1000
start=time.time()
for i in range(n):
left_merge = pd.merge_asof(df1, df2,
tolerance=pd.Timedelta('7d'), direction='backward',
left_on='endofweek', right_on='timestamp',
left_by='GroupCol', right_by='GroupVal')
right_merge = pd.merge_asof(df2, df1,
tolerance=pd.Timedelta('7d'), direction='forward',
left_on='timestamp', right_on='endofweek',
left_by='GroupVal', right_by='GroupCol')
merged = (left_merge.merge(right_merge, how="outer")
.sort_values(['endofweek', 'GroupCol', 'timestamp'])
.reset_index(drop=True))
end = time.time()
end-start
15.040804386138916
One way I tried is using groupby on one data frame, and then subsetting the other one in a pd.merge_ordered:
merged = (df1.groupby(['GroupCol', 'endofweek']).
apply(lambda x: pd.merge_ordered(x, df2[(
(df2['GroupVal']==x.name[0])
&(abs(df2['timestamp']-x.name[1])<=pd.Timedelta('7d')))],
left_on='endofweek', right_on='timestamp')))
merged
endofweek GroupCol timestamp GroupVal TextVal
GroupCol endofweek
1234 2019-08-31 0 NaT NaN 2019-08-30 10:00:00 1234.0 1234_1
1 NaT NaN 2019-08-30 10:30:00 1234.0 1234_2
2 2019-08-31 1234.0 NaT NaN NaN
2019-09-07 0 2019-09-07 1234.0 NaT NaN NaN
2019-09-14 0 NaT NaN 2019-09-08 14:00:00 1234.0 1234_3
1 2019-09-14 1234.0 NaT NaN NaN
8679 2019-08-31 0 2019-08-31 8679.0 NaT NaN NaN
2019-09-07 0 2019-09-07 8679.0 NaT NaN NaN
2019-09-14 0 NaT NaN 2019-09-07 12:00:00 8679.0 8679_1
1 2019-09-14 8679.0 NaT NaN NaN
merged[['endofweek', 'GroupCol']] = (merged[['endofweek', 'GroupCol']]
.fillna(method="bfill"))
merged.reset_index(drop=True, inplace=True)
merged
endofweek GroupCol timestamp GroupVal TextVal
0 2019-08-31 1234.0 2019-08-30 10:00:00 1234.0 1234_1
1 2019-08-31 1234.0 2019-08-30 10:30:00 1234.0 1234_2
2 2019-08-31 1234.0 NaT NaN NaN
3 2019-09-07 1234.0 NaT NaN NaN
4 2019-09-14 1234.0 2019-09-08 14:00:00 1234.0 1234_3
5 2019-09-14 1234.0 NaT NaN NaN
6 2019-08-31 8679.0 NaT NaN NaN
7 2019-09-07 8679.0 NaT NaN NaN
8 2019-09-14 8679.0 2019-09-07 12:00:00 8679.0 8679_1
9 2019-09-14 8679.0 NaT NaN NaN
However it seems to me the result is very slow:
import time
n=1000
start=time.time()
for i in range(n):
merged = (df1.groupby(['GroupCol', 'endofweek']).
apply(lambda x: pd.merge_ordered(x, df2[(
(df2['GroupVal']==x.name[0])
&(abs(df2['timestamp']-x.name[1])<=pd.Timedelta('7d')))],
left_on='endofweek', right_on='timestamp')))
end = time.time()
end-start
40.72932052612305
I would greatly appreciate any improvements!
I have the following data:
(Pdb) df1 = pd.DataFrame({'id': ['SE0000195570','SE0000195570','SE0000195570','SE0000195570','SE0000191827','SE0000191827','SE0000191827','SE0000191827', 'SE0000191827'],'val': ['1','2','3','4','5','6','7','8', '9'],'date': pd.to_datetime(['2014-10-23','2014-07-16','2014-04-29','2014-01-31','2018-10-19','2018-07-11','2018-04-20','2018-02-16','2018-12-29'])})
(Pdb) df1
id val date
0 SE0000195570 1 2014-10-23
1 SE0000195570 2 2014-07-16
2 SE0000195570 3 2014-04-29
3 SE0000195570 4 2014-01-31
4 SE0000191827 5 2018-10-19
5 SE0000191827 6 2018-07-11
6 SE0000191827 7 2018-04-20
7 SE0000191827 8 2018-02-16
8 SE0000191827 9 2018-12-29
UPDATE:
As per the suggestions of #user3483203 I have gotten a bit further but not quite there. I've amended the example data above with a new row to illustrate better.
(Pdb) df2.assign(calc=(df2.dropna()['val'].groupby(level=0).rolling(4).sum().shift(-3).reset_index(0, drop=True)))
id val date calc
id date
SE0000191827 2018-02-28 SE0000191827 8 2018-02-16 26.0
2018-03-31 NaN NaN NaT NaN
2018-04-30 SE0000191827 7 2018-04-20 27.0
2018-05-31 NaN NaN NaT NaN
2018-06-30 NaN NaN NaT NaN
2018-07-31 SE0000191827 6 2018-07-11 NaN
2018-08-31 NaN NaN NaT NaN
2018-09-30 NaN NaN NaT NaN
2018-10-31 SE0000191827 5 2018-10-19 NaN
2018-11-30 NaN NaN NaT NaN
2018-12-31 SE0000191827 9 2018-12-29 NaN
SE0000195570 2014-01-31 SE0000195570 4 2014-01-31 10.0
2014-02-28 NaN NaN NaT NaN
2014-03-31 NaN NaN NaT NaN
2014-04-30 SE0000195570 3 2014-04-29 NaN
2014-05-31 NaN NaN NaT NaN
2014-06-30 NaN NaN NaT NaN
2014-07-31 SE0000195570 2 2014-07-16 NaN
2014-08-31 NaN NaN NaT NaN
2014-09-30 NaN NaN NaT NaN
2014-10-31 SE0000195570 1 2014-10-23 NaN
For my requirements, the row (SE0000191827, 2018-03-31) should have a calc value since it has four consecutive rows with a value. Currently the row is being removed with the dropna call and I can't figure out how to solve that problem.
What I need
Calculations: The dates in my initial data is quarterly dates. However, I need to transform this data into monthly rows ranging between the first and last date of each id and for each month calculate the sum of the four closest consecutive rows of the input data within that id. That's a mouthful. This led me to resample. See expected output below. I need the data to be grouped by both id and the monthly dates.
Performance: The data I'm testing on now is just for benchmarking but I will need the solution to be performant. I'm expecting to run this on upwards of 100k unique ids which may result in around 10 million rows. (100k ids, dates range back up to 10 years, 10years * 12months = 120 months per id, 100k*120 = 12million rows).
What I've tried
(Pdb) res = df.groupby('id').resample('M',on='date')
(Pdb) res.first()
id val date
id date
SE0000191827 2018-02-28 SE0000191827 8 2018-02-16
2018-03-31 NaN NaN NaT
2018-04-30 SE0000191827 7 2018-04-20
2018-05-31 NaN NaN NaT
2018-06-30 NaN NaN NaT
2018-07-31 SE0000191827 6 2018-07-11
2018-08-31 NaN NaN NaT
2018-09-30 NaN NaN NaT
2018-10-31 SE0000191827 5 2018-10-19
SE0000195570 2014-01-31 SE0000195570 4 2014-01-31
2014-02-28 NaN NaN NaT
2014-03-31 NaN NaN NaT
2014-04-30 SE0000195570 3 2014-04-29
2014-05-31 NaN NaN NaT
2014-06-30 NaN NaN NaT
2014-07-31 SE0000195570 2 2014-07-16
2014-08-31 NaN NaN NaT
2014-09-30 NaN NaN NaT
2014-10-31 SE0000195570 1 2014-10-23
This data looks very nice for my case since it's nicely grouped by id and has the dates nicely lined up by month. Here it seems like I could use something like df['val'].rolling(4) and make sure it skips NaN values and put that result in a new column.
Expected output (new column calc):
id val date calc
id date
SE0000191827 2018-02-28 SE0000191827 8 2018-02-16 26
2018-03-31 NaN NaN NaT
2018-04-30 SE0000191827 7 2018-04-20 NaN
2018-05-31 NaN NaN NaT
2018-06-30 NaN NaN NaT
2018-07-31 SE0000191827 6 2018-07-11 NaN
2018-08-31 NaN NaN NaT
2018-09-30 NaN NaN NaT
2018-10-31 SE0000191827 5 2018-10-19 NaN
SE0000195570 2014-01-31 SE0000195570 4 2014-01-31 10
2014-02-28 NaN NaN NaT
2014-03-31 NaN NaN NaT
2014-04-30 SE0000195570 3 2014-04-29 NaN
2014-05-31 NaN NaN NaT
2014-06-30 NaN NaN NaT
2014-07-31 SE0000195570 2 2014-07-16 NaN
2014-08-31 NaN NaN NaT
2014-09-30 NaN NaN NaT
2014-10-31 SE0000195570 1 2014-10-23 NaN
2014-11-30 NaN NaN NaT
2014-12-31 SE0000195570 1 2014-10-23 NaN
Here the result in calc is 26 for the first date since it adds the three preceding (8+7+6+5). The rest for that id is NaN since four values are not available.
The problems
While it may look like the data is grouped by id and date, it seems like it's actually grouped by date. I'm not sure how this works. I need the data to be grouped by id and date.
(Pdb) res['val'].get_group(datetime.date(2018,2,28))
7 6.730000e+08
Name: val, dtype: object
The result of the resample above returns a DatetimeIndexResamplerGroupby which doesn't have rolling...
(Pdb) res['val'].rolling(4)
*** AttributeError: 'DatetimeIndexResamplerGroupby' object has no attribute 'rolling'
What to do? My guess is that my approach is wrong but after scouring the documentation I'm not sure where to start.