Compare current row timestamp with previous row with condition in function python - python

I have a df sample with one of the columns named date_code, dtype datetime64[ns]:
date_code
2022-03-28
2022-03-29
2022-03-30
2022-03-31
2022-04-01
2022-04-07
2022-04-07
2022-04-08
2022-04-12
2022-04-12
2022-04-14
2022-04-14
2022-04-15
2022-04-16
2022-04-16
2022-04-17
2022-04-18
2022-04-19
2022-04-20
2022-04-20
2022-04-21
2022-04-22
2022-04-25
2022-04-25
2022-04-26
I would like to create a column based on some conditions comparing current row with previous. I trying to create a function like:
def start_date(row):
if (row['date_code'] - row['date_code'].shift(-1)).days >1:
val = row['date_code'].shift(-1)
elif row['date_code'] == row['date_code'].shift(-1):
val = row['date_code']
else:
val = np.nan()
return val
But once I apply
sample['date_zero_recorded'] = sample.apply(start_date, axis=1)
I get error:
AttributeError: 'Timestamp' object has no attribute 'shift'
How I should compare current row with previous with condition?
Edited: expected outoput
if current row more than previous by 2 or more, get previous
if current row equal past, get current
else, return NaN (incl. if current >1 than previous)
date_code date_zero_recorded
2022-03-28 NaN
2022-03-29 NaN
2022-03-30 NaN
2022-03-31 NaN
2022-04-01 NaN
2022-04-07 2022-04-01
2022-04-07 2022-04-07
2022-04-08 NaN
2022-04-12 2022-04-08
2022-04-12 2022-04-12
2022-04-14 2022-04-12
2022-04-14 2022-04-14
2022-04-15 NaN
2022-04-16 NaN
2022-04-16 2022-04-16
2022-04-17 NaN
2022-04-18 NaN
2022-04-19 NaN
2022-04-20 NaN
2022-04-20 2022-04-20
2022-04-21 NaN
2022-04-22 NaN
2022-04-25 2022-04-22
2022-04-25 2022-04-25
2022-04-26 NaN

You shouldn't use iterrows and use vectorial code instead.
For example:
sample['date_code'] = pd.to_datetime(sample['date_code'])
sample['date_zero_recorded'] = (
sample['date_code'].shift()
.where(sample['date_code'].diff().ne('1d'))
)
output:
date_code date_zero_recorded
0 2022-03-28 NaT
1 2022-03-29 NaT
2 2022-03-30 NaT
3 2022-03-31 NaT
4 2022-04-01 NaT
5 2022-04-07 2022-04-01
6 2022-04-07 2022-04-07
7 2022-04-08 NaT
8 2022-04-12 2022-04-08
9 2022-04-12 2022-04-12
10 2022-04-14 2022-04-12
11 2022-04-14 2022-04-14
12 2022-04-15 NaT
13 2022-04-16 NaT
14 2022-04-16 2022-04-16
15 2022-04-17 NaT
16 2022-04-18 NaT
17 2022-04-19 NaT
18 2022-04-20 NaT
19 2022-04-20 2022-04-20
20 2022-04-21 NaT
21 2022-04-22 NaT
22 2022-04-25 2022-04-22
23 2022-04-25 2022-04-25
24 2022-04-26 NaT

Related

dataframe data transfer with selected values to another dataframe

My goal is selecting the column Sabah in dataframe prdt and entering every value to repeated rows called Sabah in dataframe prcal
prcal
Vakit Start_Date End_Date Start_Time End_Time
0 Sabah 2022-01-01 2022-01-01 NaN NaN
1 Güneş 2022-01-01 2022-01-01 NaN NaN
2 Öğle 2022-01-01 2022-01-01 NaN NaN
3 İkindi 2022-01-01 2022-01-01 NaN NaN
4 Akşam 2022-01-01 2022-01-01 NaN NaN
..........................................................
2184 Sabah 2022-12-31 2022-12-31 NaN NaN
2185 Güneş 2022-12-31 2022-12-31 NaN NaN
2186 Öğle 2022-12-31 2022-12-31 NaN NaN
2187 İkindi 2022-12-31 2022-12-31 NaN NaN
2188 Akşam 2022-12-31 2022-12-31 NaN NaN
2189 rows × 5 columns
prdt
Day Sabah Güneş Öğle İkindi Akşam Yatsı
0 2022-01-01 06:51:00 08:29:00 13:08:00 15:29:00 17:47:00 19:20:00
1 2022-01-02 06:51:00 08:29:00 13:09:00 15:30:00 17:48:00 19:21:00
2 2022-01-03 06:51:00 08:29:00 13:09:00 15:30:00 17:48:00 19:22:00
3 2022-01-04 06:51:00 08:29:00 13:09:00 15:31:00 17:49:00 19:22:00
4 2022-01-05 06:51:00 08:29:00 13:10:00 15:32:00 17:50:00 19:23:00
...........................................................................
360 2022-12-27 06:49:00 08:27:00 13:06:00 15:25:00 17:43:00 19:16:00
361 2022-12-28 06:50:00 08:28:00 13:06:00 15:26:00 17:43:00 19:17:00
362 2022-12-29 06:50:00 08:28:00 13:07:00 15:26:00 17:44:00 19:18:00
363 2022-12-30 06:50:00 08:28:00 13:07:00 15:27:00 17:45:00 19:18:00
364 2022-12-31 06:50:00 08:28:00 13:07:00 15:28:00 17:46:00 19:19:00
365 rows × 7 columns
Selected every row called sabah prcal.iloc[::6,:]
Made a list for prdt['Sabah'].
When integrating prcal.iloc[::6,:] = prdt['Sabah'][0:365] I get a value error:
ValueError: Must have equal len keys and value when setting with an iterable

python dont change rename columns

After calculating the data output to the data frame, if the columns of the output data are blank,
The rename function is not working.
time
2022-03-30 22:45:00 NaN
2022-03-30 22:46:00 NaN
2022-03-30 22:47:00 NaN
2022-03-30 22:48:00 NaN
2022-03-30 22:49:00 NaN
...
2022-03-31 15:20:00 43.937125
2022-03-31 15:21:00 42.781336
2022-03-31 15:22:00 43.228084
2022-03-31 15:23:00 48.822237
2022-03-31 15:24:00 58.590912
Name: close, Length: 1000, dtype: float64
<class 'pandas.core.series.Series'>
What should I do if I want to put my name in the line next to time?
If the expected result is:
time close
0 2022-03-30 22:45:00 NaN
1 2022-03-30 22:46:00 NaN
2 2022-03-30 22:47:00 NaN
3 2022-03-30 22:48:00 NaN
4 2022-03-30 22:49:00 NaN
5 2022-03-31 15:20:00 43.937125
6 2022-03-31 15:21:00 42.781336
7 2022-03-31 15:22:00 43.228084
8 2022-03-31 15:23:00 48.822237
9 2022-03-31 15:24:00 58.590912
You just have to use df.reset_index() because the dataset above is not a DataFrame but a Series. If you want to keep time as index, you can use df.to_frame() or df.to_frame('new_name').

pandas_ta parabolic SAR giving wrong values for yfinance

I made a function that uses the psar function from the pandas_ta library. This function seems to work incorrectly, it gives the PSARl, PSARs and PSARr values on the wrong dates. While using an interval of 1 day on BTC-USD I get the following output:
Used function:
psar = df.ta.psar(high=df['High'], low=df['Low'], close=df['Close'], af0=0.02, af=0.02, max_af=0.2)
print(psar)
Output:
PSARl_0.02_0.2 PSARs_0.02_0.2 PSARaf_0.02_0.2 PSARr_0.02_0.2
Date
2021-10-29 NaN NaN 0.02 0
2021-10-30 NaN 62927.609375 0.02 1
2021-10-31 NaN 62927.609375 0.04 0
2021-11-01 NaN 62813.478125 0.06 0
2021-11-02 59695.183594 NaN 0.02 1
2021-11-03 59695.183594 NaN 0.02 0
2021-11-04 59786.135781 NaN 0.02 0
2021-11-05 59875.268925 NaN 0.02 0
2021-11-06 59962.619406 NaN 0.02 0
2021-11-07 60048.222877 NaN 0.02 0
2021-11-08 60132.114279 NaN 0.04 0
2021-11-09 60433.779395 NaN 0.06 0
2021-11-10 60919.572788 NaN 0.08 0
2021-11-11 61549.176965 NaN 0.08 0
2021-11-12 62128.412808 NaN 0.08 0
2021-11-13 62333.914062 NaN 0.08 0
2021-11-14 62333.914062 NaN 0.08 0
2021-11-15 62850.370938 NaN 0.08 0
2021-11-16 NaN 68789.625000 0.02 1
2021-11-17 NaN 68594.159219 0.04 0
2021-11-18 NaN 68191.009256 0.06 0
2021-11-19 NaN 67492.596279 0.08 0
2021-11-20 NaN 66549.602952 0.08 0
2021-11-21 NaN 65682.049091 0.08 0
2021-11-22 NaN 64883.899538 0.10 0
2021-11-23 NaN 63963.493569 0.12 0
2021-11-24 NaN 62963.805747 0.12 0
2021-11-25 NaN 62084.080463 0.12 0
2021-11-26 NaN 61309.922214 0.14 0
2021-11-27 NaN 60226.300292 0.14 0
2021-11-28 NaN 59294.385438 0.14 0
2021-11-29 NaN 58492.938664 0.14 0
While looking at the yfinance chart for the BTC-USD I should not get a 1 in the PSARr column at 2021-10-30 but I somehow am. It's really random because some values are correct but some of them aren't. What am I doing wrong or is there something wrong within the function?
Thanks!
picture:
Other data:
Open High Low Close \
Date
2021-10-29 60624.871094 62927.609375 60329.964844 62227.964844
2021-10-30 62239.363281 62330.144531 60918.386719 61888.832031
2021-10-31 61850.488281 62406.171875 60074.328125 61318.957031
2021-11-01 61320.449219 62419.003906 59695.183594 61004.406250
2021-11-02 60963.253906 64242.792969 60673.054688 63226.402344
2021-11-03 63254.335938 63516.937500 61184.238281 62970.046875
2021-11-04 62941.804688 63123.289062 60799.664062 61452.230469
2021-11-05 61460.078125 62541.468750 60844.609375 61125.675781
2021-11-06 61068.875000 61590.683594 60163.781250 61527.480469
2021-11-07 61554.921875 63326.988281 61432.488281 63326.988281
2021-11-08 63344.066406 67673.742188 63344.066406 67566.828125
2021-11-09 67549.734375 68530.335938 66382.062500 66971.828125
2021-11-10 66953.335938 68789.625000 63208.113281 64995.230469
2021-11-11 64978.890625 65579.015625 64180.488281 64949.960938
2021-11-12 64863.980469 65460.816406 62333.914062 64155.941406
2021-11-13 64158.121094 64915.675781 63303.734375 64469.527344
2021-11-14 64455.371094 65495.179688 63647.808594 65466.839844
2021-11-15 65521.289062 66281.570312 63548.144531 63557.871094
2021-11-16 63721.195312 63721.195312 59016.335938 60161.246094
2021-11-17 60139.621094 60823.609375 58515.410156 60368.011719
2021-11-18 60360.136719 60948.500000 56550.792969 56942.136719
2021-11-19 56896.128906 58351.113281 55705.179688 58119.578125
2021-11-20 58115.082031 59859.878906 57469.726562 59697.195312
2021-11-21 59730.507812 60004.425781 58618.929688 58730.476562
2021-11-22 58706.847656 59266.359375 55679.839844 56289.289062
2021-11-23 56304.554688 57875.515625 55632.761719 57569.074219
2021-11-24 57565.851562 57803.066406 55964.222656 56280.425781
2021-11-25 57165.417969 59367.968750 57146.683594 57274.679688
2021-11-26 58960.285156 59183.480469 53569.765625 53569.765625
2021-11-27 53736.429688 55329.257812 53668.355469 54815.078125
2021-11-28 54813.023438 57393.843750 53576.734375 57248.457031
2021-11-29 57474.843750 58749.250000 56856.371094 58749.250000
Adj Close Volume
Date
2021-10-29 62227.964844 36856881767
2021-10-30 61888.832031 32157938616
2021-10-31 61318.957031 32241199927
2021-11-01 61004.406250 36150572843
2021-11-02 63226.402344 37746665647
2021-11-03 62970.046875 36124731509
2021-11-04 61452.230469 32615846901
2021-11-05 61125.675781 30605102446
2021-11-06 61527.480469 29094934221
2021-11-07 63326.988281 24726754302
2021-11-08 67566.828125 41125608330
2021-11-09 66971.828125 42357991721
2021-11-10 64995.230469 48730828378
2021-11-11 64949.960938 35880633236
2021-11-12 64155.941406 36084893887
2021-11-13 64469.527344 30474228777
2021-11-14 65466.839844 25122092191
2021-11-15 63557.871094 30558763548
2021-11-16 60161.246094 46844335592
2021-11-17 60368.011719 39178392930
2021-11-18 56942.136719 41388338699
2021-11-19 58119.578125 38702407772
2021-11-20 59697.195312 30624264863
2021-11-21 58730.476562 26123447605
2021-11-22 56289.289062 35036121783
2021-11-23 57569.074219 37485803899
2021-11-24 56280.425781 36635566789
2021-11-25 57274.679688 34284016248
2021-11-26 53569.765625 41810748221
2021-11-27 54815.078125 30560857714
2021-11-28 57248.457031 28116886357
2021-11-29 58749.250000 33326104576
psar = df.ta.psar(high=df['High'], low=df['Low'], close=df['Close'], af0=0.02, af=0.02, max_af=0.2)
print(psar)
#PSAR r u need
mypsar = pd.DataFrame(psar)["PSARr_0.2_0.2"].iloc[-1]
if psar > 0:
mysartrend = "Bull"
else:
mysartrend = "Bear"
print(mysartrend)

pd.merge_asof with multiple matches per time period?

I'm trying to merge two dataframes by time with multiple matches. I'm looking for all the instances of df2 whose timestamp falls 7 days or less before endofweek in df1. There may be more than one record that fits the case, and I want all of the matches, not just the first or last (which pd.merge_asof does).
import pandas as pd
df1 = pd.DataFrame({'endofweek': ['2019-08-31', '2019-08-31', '2019-09-07', '2019-09-07', '2019-09-14', '2019-09-14'], 'GroupCol': [1234,8679,1234,8679,1234,8679]})
df2 = pd.DataFrame({'timestamp': ['2019-08-30 10:00', '2019-08-30 10:30', '2019-09-07 12:00', '2019-09-08 14:00'], 'GroupVal': [1234, 1234, 8679, 1234], 'TextVal': ['1234_1', '1234_2', '8679_1', '1234_3']})
df1['endofweek'] = pd.to_datetime(df1['endofweek'])
df2['timestamp'] = pd.to_datetime(df2['timestamp'])
I've tried
pd.merge_asof(df1, df2, tolerance=pd.Timedelta('7d'), direction='backward', left_on='endofweek', right_on='timestamp', left_by='GroupCol', right_by='GroupVal')
but that gets me
endofweek GroupCol timestamp GroupVal TextVal
0 2019-08-31 1234 2019-08-30 10:30:00 1234.0 1234_2
1 2019-08-31 8679 NaT NaN NaN
2 2019-09-07 1234 NaT NaN NaN
3 2019-09-07 8679 NaT NaN NaN
4 2019-09-14 1234 2019-09-08 14:00:00 1234.0 1234_3
5 2019-09-14 8679 2019-09-07 12:00:00 8679.0 8679_1
I'm losing the text 1234_1. Is there way to do a sort of outer join for pd.merge_asof, where I can keep all of the instances of df2 and not just the first or last?
My ideal result would look like this (assuming that the endofweek times are treated like 00:00:00 on that date):
endofweek GroupCol timestamp GroupVal TextVal
0 2019-08-31 1234 2019-08-30 10:00:00 1234.0 1234_1
1 2019-08-31 1234 2019-08-30 10:30:00 1234.0 1234_2
2 2019-08-31 8679 NaT NaN NaN
3 2019-09-07 1234 NaT NaN NaN
4 2019-09-07 8679 NaT NaN NaN
5 2019-09-14 1234 2019-09-08 14:00:00 1234.0 1234_3
6 2019-09-14 8679 2019-09-07 12:00:00 8679.0 8679_1
pd.merge_asof only does a left join. After a lot of frustration trying to speed up the groupby/merge_ordered example, it's more intuitive and faster to do pd.merge_asof on both data sources in different directions, and then do an outer join to combine them.
left_merge = pd.merge_asof(df1, df2,
tolerance=pd.Timedelta('7d'), direction='backward',
left_on='endofweek', right_on='timestamp',
left_by='GroupCol', right_by='GroupVal')
right_merge = pd.merge_asof(df2, df1,
tolerance=pd.Timedelta('7d'), direction='forward',
left_on='timestamp', right_on='endofweek',
left_by='GroupVal', right_by='GroupCol')
merged = (left_merge.merge(right_merge, how="outer")
.sort_values(['endofweek', 'GroupCol', 'timestamp'])
.reset_index(drop=True))
merged
endofweek GroupCol timestamp GroupVal TextVal
0 2019-08-31 1234 2019-08-30 10:00:00 1234.0 1234_1
1 2019-08-31 1234 2019-08-30 10:30:00 1234.0 1234_2
2 2019-08-31 8679 NaT NaN NaN
3 2019-09-07 1234 NaT NaN NaN
4 2019-09-07 8679 NaT NaN NaN
5 2019-09-14 1234 2019-09-08 14:00:00 1234.0 1234_3
6 2019-09-14 8679 2019-09-07 12:00:00 8679.0 8679_1
In addition, it is much faster than my other answer:
import time
n=1000
start=time.time()
for i in range(n):
left_merge = pd.merge_asof(df1, df2,
tolerance=pd.Timedelta('7d'), direction='backward',
left_on='endofweek', right_on='timestamp',
left_by='GroupCol', right_by='GroupVal')
right_merge = pd.merge_asof(df2, df1,
tolerance=pd.Timedelta('7d'), direction='forward',
left_on='timestamp', right_on='endofweek',
left_by='GroupVal', right_by='GroupCol')
merged = (left_merge.merge(right_merge, how="outer")
.sort_values(['endofweek', 'GroupCol', 'timestamp'])
.reset_index(drop=True))
end = time.time()
end-start
15.040804386138916
One way I tried is using groupby on one data frame, and then subsetting the other one in a pd.merge_ordered:
merged = (df1.groupby(['GroupCol', 'endofweek']).
apply(lambda x: pd.merge_ordered(x, df2[(
(df2['GroupVal']==x.name[0])
&(abs(df2['timestamp']-x.name[1])<=pd.Timedelta('7d')))],
left_on='endofweek', right_on='timestamp')))
merged
endofweek GroupCol timestamp GroupVal TextVal
GroupCol endofweek
1234 2019-08-31 0 NaT NaN 2019-08-30 10:00:00 1234.0 1234_1
1 NaT NaN 2019-08-30 10:30:00 1234.0 1234_2
2 2019-08-31 1234.0 NaT NaN NaN
2019-09-07 0 2019-09-07 1234.0 NaT NaN NaN
2019-09-14 0 NaT NaN 2019-09-08 14:00:00 1234.0 1234_3
1 2019-09-14 1234.0 NaT NaN NaN
8679 2019-08-31 0 2019-08-31 8679.0 NaT NaN NaN
2019-09-07 0 2019-09-07 8679.0 NaT NaN NaN
2019-09-14 0 NaT NaN 2019-09-07 12:00:00 8679.0 8679_1
1 2019-09-14 8679.0 NaT NaN NaN
merged[['endofweek', 'GroupCol']] = (merged[['endofweek', 'GroupCol']]
.fillna(method="bfill"))
merged.reset_index(drop=True, inplace=True)
merged
endofweek GroupCol timestamp GroupVal TextVal
0 2019-08-31 1234.0 2019-08-30 10:00:00 1234.0 1234_1
1 2019-08-31 1234.0 2019-08-30 10:30:00 1234.0 1234_2
2 2019-08-31 1234.0 NaT NaN NaN
3 2019-09-07 1234.0 NaT NaN NaN
4 2019-09-14 1234.0 2019-09-08 14:00:00 1234.0 1234_3
5 2019-09-14 1234.0 NaT NaN NaN
6 2019-08-31 8679.0 NaT NaN NaN
7 2019-09-07 8679.0 NaT NaN NaN
8 2019-09-14 8679.0 2019-09-07 12:00:00 8679.0 8679_1
9 2019-09-14 8679.0 NaT NaN NaN
However it seems to me the result is very slow:
import time
n=1000
start=time.time()
for i in range(n):
merged = (df1.groupby(['GroupCol', 'endofweek']).
apply(lambda x: pd.merge_ordered(x, df2[(
(df2['GroupVal']==x.name[0])
&(abs(df2['timestamp']-x.name[1])<=pd.Timedelta('7d')))],
left_on='endofweek', right_on='timestamp')))
end = time.time()
end-start
40.72932052612305
I would greatly appreciate any improvements!

How do I use conditional logic with Datetime columns in Pandas?

I have two datetime columns - ColumnA and ColumnB. I want to create a new column - ColumnC, using conditional logic.
Originally, I created ColumnB from a YearMonth column of dates such as 201907, 201908, etc.
When ColumnA is NaN, I want to choose ColumnB.
Otherwise, I want to choose ColumnA.
Currently, my code below is causing ColumnC to have different formats. I'm not sure how to get rid of all of those 0's. I want the whole column to be YYYY-MM-DD.
ID YearMonth ColumnA ColumnB ColumnC
0 1 201712 2017-12-29 2017-12-31 2017-12-29
1 1 201801 2018-01-31 2018-01-31 2018-01-31
2 1 201802 2018-02-28 2018-02-28 2018-02-28
3 1 201806 2018-06-29 2018-06-30 2018-06-29
4 1 201807 2018-07-31 2018-07-31 2018-07-31
5 1 201808 2018-08-31 2018-08-31 2018-08-31
6 1 201809 2018-09-28 2018-09-30 2018-09-28
7 1 201810 2018-10-31 2018-10-31 2018-10-31
8 1 201811 2018-11-30 2018-11-30 2018-11-30
9 1 201812 2018-12-31 2018-12-31 2018-12-31
10 1 201803 NaN 2018-03-31 1522454400000000000
11 1 201804 NaN 2018-04-30 1525046400000000000
12 1 201805 NaN 2018-05-31 1527724800000000000
13 1 201901 NaN 2019-01-31 1548892800000000000
14 1 201902 NaN 2019-02-28 1551312000000000000
15 1 201903 NaN 2019-03-31 1553990400000000000
16 1 201904 NaN 2019-04-30 1556582400000000000
17 1 201905 NaN 2019-05-31 1559260800000000000
18 1 201906 NaN 2019-06-30 1561852800000000000
19 1 201907 NaN 2019-07-31 1564531200000000000
20 1 201908 NaN 2019-08-31 1567209600000000000
21 1 201909 NaN 2019-09-30 1569801600000000000
df['ColumnB'] = pd.to_datetime(df['YearMonth'], format='%Y%m', errors='coerce').dropna() + pd.offsets.MonthEnd(0)
df['ColumnC'] = np.where(pd.isna(df['ColumnA']), pd.to_datetime(df['ColumnB'], format='%Y%m%d'), df['ColumnA'])
df['ColumnC'] = np.where(df['ColumnA'].isnull(),df['ColumnB'] , df['ColumnA'])
Just figured it out!
df['ColumnC'] = np.where(pd.isna(df['ColumnA']), pd.to_datetime(df['ColumnB']), pd.to_datetime(df['ColumnA']))

Categories