Pandas/Datetime rounding error (Rounding to 10 seconds) - python

I have a problem using the round() funktion on a pandas.Series / DataFrame filled with Timestamps.
Code example to round to the nearest 10-second mark:
import pandas as pd
date = pd.DataFrame([pd.to_datetime('2022-03-02 06:46:05'), pd.to_datetime('2022-03-02 06:46:15'), pd.to_datetime('2022-03-02 06:46:25'), pd.to_datetime('2022-03-02 06:46:35'), pd.to_datetime('2022-03-02 06:46:45'), pd.to_datetime('2022-03-02 06:46:55'), pd.to_datetime('2022-03-02 06:47:05'), pd.to_datetime('2022-03-02 06:47:15'), pd.to_datetime('2022-03-02 06:47:25')])
date[1] = date[0].round('10s')
date
OUT:
0 1
0 2022-03-02 06:46:05 2022-03-02 06:46:00
1 2022-03-02 06:46:15 2022-03-02 06:46:20
2 2022-03-02 06:46:25 2022-03-02 06:46:20
3 2022-03-02 06:46:35 2022-03-02 06:46:40
4 2022-03-02 06:46:45 2022-03-02 06:46:40
5 2022-03-02 06:46:55 2022-03-02 06:47:00
6 2022-03-02 06:47:05 2022-03-02 06:47:00
7 2022-03-02 06:47:15 2022-03-02 06:47:20
8 2022-03-02 06:47:25 2022-03-02 06:47:20
dtype: datetime64[ns]
Whenever a Timestamp has a seconds value in [5, 25, 45] the rounded value is set to [0, 20, 40], although it should be set to [10, 30, 50]. Any idea on how to fix this?
Thanks in advance!

Use trick - add some small timedelta, because python should round 5 not like expected:
date[1] = date[0].add(pd.Timedelta('1us')).round('10s')
print (date)
0 1
0 2022-03-02 06:46:05 2022-03-02 06:46:10
1 2022-03-02 06:46:15 2022-03-02 06:46:20
2 2022-03-02 06:46:25 2022-03-02 06:46:30
3 2022-03-02 06:46:35 2022-03-02 06:46:40
4 2022-03-02 06:46:45 2022-03-02 06:46:50
5 2022-03-02 06:46:55 2022-03-02 06:47:00
6 2022-03-02 06:47:05 2022-03-02 06:47:10
7 2022-03-02 06:47:15 2022-03-02 06:47:20
8 2022-03-02 06:47:25 2022-03-02 06:47:30

Try to use .dt.ceil():
date[1] = date[0].dt.ceil('10s')
print(date)
Prints:
0 1
0 2022-03-02 06:46:04 2022-03-02 06:46:10
1 2022-03-02 06:46:05 2022-03-02 06:46:10
2 2022-03-02 06:46:06 2022-03-02 06:46:10
3 2022-03-02 06:46:24 2022-03-02 06:46:30
4 2022-03-02 06:46:25 2022-03-02 06:46:30
5 2022-03-02 06:46:26 2022-03-02 06:46:30
6 2022-03-02 06:46:44 2022-03-02 06:46:50
7 2022-03-02 06:46:45 2022-03-02 06:46:50
8 2022-03-02 06:46:46 2022-03-02 06:46:50

Related

how to create a pandas data frame with the first days and the last days of Months

I would like to create such a data frame:
0 2022-01-01
1 2022-01-31
2 2022-02-01
3 2022-02-28
4 2022-03-01
5 2022-03-31
I tried to use this, however did not figure it out.
dfpd.date_range(start = '1/1/2022', end ='6/30/2022', freq='M'),
You can utilize list comprehensions and .offsets:
date_range = pd.date_range(start="1/1/2022", end="6/30/2022", freq="M")
month_spans = [[x + pd.offsets.MonthBegin(-1), x + pd.offsets.MonthEnd(0)] for x in date_range]
dates = [x for sublist in month_spans for x in sublist]
df = pd.Series(dates).to_frame("date")
print(df)
Output:
date
0 2022-01-01
1 2022-01-31
2 2022-02-01
3 2022-02-28
4 2022-03-01
5 2022-03-31
6 2022-04-01
7 2022-04-30
8 2022-05-01
9 2022-05-31
10 2022-06-01
11 2022-06-30
You can use:
import pandas as pd
start = '2022-01-01'
end = '2022-06-30'
first_day = pd.date_range(start, end, freq='MS').astype(str) #get first day of month given range
last_day = pd.date_range(start, end, freq='M').astype(str) #get last day of month given range
df=pd.DataFrame({'first_day':first_day,'last_day':last_day})
df = df.stack().reset_index(drop=True)
Output:
0
0 2022-01-01
1 2022-01-31
2 2022-02-01
3 2022-02-28
4 2022-03-01
5 2022-03-31
6 2022-04-01
7 2022-04-30
8 2022-05-01
9 2022-05-31
10 2022-06-01
11 2022-06-30
Might not be the cleanest answer out there, but it does work...
import calendar
import pandas as pd
Year2022=[]
for i in list(range(1,13,1 )):
weekday, numdays = calendar.monthrange(2022, i)
Year2022.append(str(i)+"/1/2022")
Year2022.append(str(i)+"/"+str(numdays)+"/2022")
df = pd.DataFrame({'DateTime':Year2022})
df['DateTime'] = pd.to_datetime(df['DateTime'])
Returns
df
DateTime
0 2022-01-01
1 2022-01-31
2 2022-02-01
3 2022-02-28
4 2022-03-01
5 2022-03-31
6 2022-04-01
7 2022-04-30
8 2022-05-01
9 2022-05-31
10 2022-06-01
11 2022-06-30
12 2022-07-01
13 2022-07-31
14 2022-08-01
15 2022-08-31
16 2022-09-01
17 2022-09-30
18 2022-10-01
19 2022-10-31
20 2022-11-01
21 2022-11-30
22 2022-12-01
23 2022-12-31

Group consecutive rises and falls using Pandas Series

I want to group consecutive growth and falls in pandas series. I have tried this, but it seems not working:
consec_rises = self.df_dataset.diff().cumsum()
group_consec = consec_rises.groupby(consec_rises)
My dataset:
date
2022-01-07 25.847718
2022-01-08 29.310294
2022-01-09 31.791339
2022-01-10 33.382136
2022-01-11 31.791339
2022-01-12 29.310294
2022-01-13 25.847718
2022-01-14 21.523483
2022-01-15 16.691068
2022-01-16 11.858653
2022-01-17 7.534418
I want to get result as following:
Group #1 (consecutive growth)
2022-01-07 25.847718
2022-01-08 29.310294
2022-01-09 31.791339
2022-01-10 33.382136
Group #2 (consecutive fall)
2022-01-12 29.310294
2022-01-13 25.847718
2022-01-14 21.523483
2022-01-15 16.691068
2022-01-16 11.858653
2022-01-17 7.534418
If I understand you correctly:
mask = df["date"].diff().bfill() >= 0
for _, g in df.groupby((mask != mask.shift(1)).cumsum()):
print(g)
print("-" * 80)
Prints:
date
2022-01-07 25.847718
2022-01-08 29.310294
2022-01-09 31.791339
2022-01-10 33.382136
--------------------------------------------------------------------------------
date
2022-01-11 31.791339
2022-01-12 29.310294
2022-01-13 25.847718
2022-01-14 21.523483
2022-01-15 16.691068
2022-01-16 11.858653
2022-01-17 7.534418
--------------------------------------------------------------------------------

Find the closest date from a list to a given date that is not after the given date

I have a dataframe for weekly training sessions and a data frame for evaluations submitted by attendees at those training sessions.
Each dataframe has a date column - for sessions, it is the date the session occurred. For evaluations, it is the date the evaluation was submitted. Attendees can be expected to attend multiple sessions and will therefore have submitted multiple evaluations.
I need to tie each evaluation back to a specific session. They may have submitted an evaluation on the same day as a session, in which case the match is easy. But they are able to submit an evaluation on any day up to the next training session.
For each date in the evaluation df, I need to return the session date that is closest to the evaluation date but not after the evaluation date.
example session dates:
2/3/22, 2/10/22, 2/17/22
example evaluation dates with desired output:
2/3/22 (should match 2/3/22), 2/4/22 (should match 2/3/22), 2/11/22 (should match 2/10/22)
Here's a way to do it.
In the sessions dataframe, set date column to be the index:
sessions = sessions.set_index('date')
Sort sessions by index (that is, by date):
sessions = sessions.loc[sessions.index.sort_values()]
Add a session_evaluated column to evaluations which will contain the date of the session that the evaluation applies to. We calculate this by first calling sessions.index.get_indexer() on the date column of evaluations with the method argument set to 'pad' so we "round down" on non-matching dates, and then doing a lookup on these integer index values in the sessions index (which contains the session dates):
evaluations['session_evaluated'] = pd.Series([sessions.index.to_list()[i]
for i in sessions.index.get_indexer(evaluations['date'], method='pad')])
Here's what it looks like all put together with sample inputs:
import pandas as pd
sessions = pd.DataFrame({
'date' : ['2022-02-01', '2022-03-01', '2022-04-01', '2022-05-01', '2022-01-01'],
'topic' : ['Easy 1', 'Easy 2', 'Intermediate', 'Advanced', 'Intro']
})
evaluations = pd.DataFrame({
'date' : [
'2022-01-05', '2022-01-10', '2022-01-15', '2022-01-20', '2022-01-25',
'2022-02-01', '2022-02-05', '2022-02-28',
'2022-03-01', '2022-03-15', '2022-03-31',
'2022-04-01', '2022-04-15'
],
'rating' : [9,8,7,8,9,5,4,3,10,10,10,2,4]
})
sessions['date'] = pd.to_datetime(sessions['date'])
evaluations['date'] = pd.to_datetime(evaluations['date'])
sessions = sessions.set_index('date')
sessions = sessions.loc[sessions.index.sort_values()]
print(sessions)
print(evaluations)
evaluations['session_evaluated'] = pd.Series([sessions.index.to_list()[i]
for i in sessions.index.get_indexer(evaluations['date'], method='pad')])
print(evaluations)
Results:
topic
date
2022-01-01 Intro
2022-02-01 Easy 1
2022-03-01 Easy 2
2022-04-01 Intermediate
2022-05-01 Advanced
date rating
0 2022-01-05 9
1 2022-01-10 8
2 2022-01-15 7
3 2022-01-20 8
4 2022-01-25 9
5 2022-02-01 5
6 2022-02-05 4
7 2022-02-28 3
8 2022-03-01 10
9 2022-03-15 10
10 2022-03-31 10
11 2022-04-01 2
12 2022-04-15 4
date rating session_evaluated
0 2022-01-05 9 2022-01-01
1 2022-01-10 8 2022-01-01
2 2022-01-15 7 2022-01-01
3 2022-01-20 8 2022-01-01
4 2022-01-25 9 2022-01-01
5 2022-02-01 5 2022-02-01
6 2022-02-05 4 2022-02-01
7 2022-02-28 3 2022-02-01
8 2022-03-01 10 2022-03-01
9 2022-03-15 10 2022-03-01
10 2022-03-31 10 2022-03-01
11 2022-04-01 2 2022-04-01
12 2022-04-15 4 2022-04-01
UPDATED:
Here's another way to do it using the merge_asof() function. It doesn't require the date column to be the index (though it does require that both dataframe arguments be sorted by date):
sessions['date'] = pd.to_datetime(sessions['date'])
evaluations['date'] = pd.to_datetime(evaluations['date'])
evaluations = pd.merge_asof(
evaluations.sort_values(by=['date']),
sessions.sort_values(by=['date'])['date'].to_frame().assign(session_evaluated=sessions['date']),
on='date')
print(evaluations)
Output:
date rating session_evaluated
0 2022-01-05 9 2022-01-01
1 2022-01-10 8 2022-01-01
2 2022-01-15 7 2022-01-01
3 2022-01-20 8 2022-01-01
4 2022-01-25 9 2022-01-01
5 2022-02-01 5 2022-02-01
6 2022-02-05 4 2022-02-01
7 2022-02-28 3 2022-02-01
8 2022-03-01 10 2022-03-01
9 2022-03-15 10 2022-03-01
10 2022-03-31 10 2022-03-01
11 2022-04-01 2 2022-04-01
12 2022-04-15 4 2022-04-01
UPDATE #2:
The call to assign() in the above code can also be written using **kwargs syntax, in case we want to use a column name with spaces or that otherwise is not a valid python identifier (instead of session_evaluated). For example:
evaluations = pd.merge_asof(
evaluations.sort_values(by=['date']),
sessions.sort_values(by=['date'])['date'].to_frame()
.assign(**{'Evaluated Session (Date)' : lambda x: sessions['date']}),
on='date')
Output:
date rating Evaluated Session (Date)
0 2022-01-05 9 2022-01-01
1 2022-01-10 8 2022-01-01
2 2022-01-15 7 2022-01-01
3 2022-01-20 8 2022-01-01
4 2022-01-25 9 2022-01-01
5 2022-02-01 5 2022-02-01
6 2022-02-05 4 2022-02-01
7 2022-02-28 3 2022-02-01
8 2022-03-01 10 2022-03-01
9 2022-03-15 10 2022-03-01
10 2022-03-31 10 2022-03-01
11 2022-04-01 2 2022-04-01
12 2022-04-15 4 2022-04-01

pandas_ta parabolic SAR giving wrong values for yfinance

I made a function that uses the psar function from the pandas_ta library. This function seems to work incorrectly, it gives the PSARl, PSARs and PSARr values on the wrong dates. While using an interval of 1 day on BTC-USD I get the following output:
Used function:
psar = df.ta.psar(high=df['High'], low=df['Low'], close=df['Close'], af0=0.02, af=0.02, max_af=0.2)
print(psar)
Output:
PSARl_0.02_0.2 PSARs_0.02_0.2 PSARaf_0.02_0.2 PSARr_0.02_0.2
Date
2021-10-29 NaN NaN 0.02 0
2021-10-30 NaN 62927.609375 0.02 1
2021-10-31 NaN 62927.609375 0.04 0
2021-11-01 NaN 62813.478125 0.06 0
2021-11-02 59695.183594 NaN 0.02 1
2021-11-03 59695.183594 NaN 0.02 0
2021-11-04 59786.135781 NaN 0.02 0
2021-11-05 59875.268925 NaN 0.02 0
2021-11-06 59962.619406 NaN 0.02 0
2021-11-07 60048.222877 NaN 0.02 0
2021-11-08 60132.114279 NaN 0.04 0
2021-11-09 60433.779395 NaN 0.06 0
2021-11-10 60919.572788 NaN 0.08 0
2021-11-11 61549.176965 NaN 0.08 0
2021-11-12 62128.412808 NaN 0.08 0
2021-11-13 62333.914062 NaN 0.08 0
2021-11-14 62333.914062 NaN 0.08 0
2021-11-15 62850.370938 NaN 0.08 0
2021-11-16 NaN 68789.625000 0.02 1
2021-11-17 NaN 68594.159219 0.04 0
2021-11-18 NaN 68191.009256 0.06 0
2021-11-19 NaN 67492.596279 0.08 0
2021-11-20 NaN 66549.602952 0.08 0
2021-11-21 NaN 65682.049091 0.08 0
2021-11-22 NaN 64883.899538 0.10 0
2021-11-23 NaN 63963.493569 0.12 0
2021-11-24 NaN 62963.805747 0.12 0
2021-11-25 NaN 62084.080463 0.12 0
2021-11-26 NaN 61309.922214 0.14 0
2021-11-27 NaN 60226.300292 0.14 0
2021-11-28 NaN 59294.385438 0.14 0
2021-11-29 NaN 58492.938664 0.14 0
While looking at the yfinance chart for the BTC-USD I should not get a 1 in the PSARr column at 2021-10-30 but I somehow am. It's really random because some values are correct but some of them aren't. What am I doing wrong or is there something wrong within the function?
Thanks!
picture:
Other data:
Open High Low Close \
Date
2021-10-29 60624.871094 62927.609375 60329.964844 62227.964844
2021-10-30 62239.363281 62330.144531 60918.386719 61888.832031
2021-10-31 61850.488281 62406.171875 60074.328125 61318.957031
2021-11-01 61320.449219 62419.003906 59695.183594 61004.406250
2021-11-02 60963.253906 64242.792969 60673.054688 63226.402344
2021-11-03 63254.335938 63516.937500 61184.238281 62970.046875
2021-11-04 62941.804688 63123.289062 60799.664062 61452.230469
2021-11-05 61460.078125 62541.468750 60844.609375 61125.675781
2021-11-06 61068.875000 61590.683594 60163.781250 61527.480469
2021-11-07 61554.921875 63326.988281 61432.488281 63326.988281
2021-11-08 63344.066406 67673.742188 63344.066406 67566.828125
2021-11-09 67549.734375 68530.335938 66382.062500 66971.828125
2021-11-10 66953.335938 68789.625000 63208.113281 64995.230469
2021-11-11 64978.890625 65579.015625 64180.488281 64949.960938
2021-11-12 64863.980469 65460.816406 62333.914062 64155.941406
2021-11-13 64158.121094 64915.675781 63303.734375 64469.527344
2021-11-14 64455.371094 65495.179688 63647.808594 65466.839844
2021-11-15 65521.289062 66281.570312 63548.144531 63557.871094
2021-11-16 63721.195312 63721.195312 59016.335938 60161.246094
2021-11-17 60139.621094 60823.609375 58515.410156 60368.011719
2021-11-18 60360.136719 60948.500000 56550.792969 56942.136719
2021-11-19 56896.128906 58351.113281 55705.179688 58119.578125
2021-11-20 58115.082031 59859.878906 57469.726562 59697.195312
2021-11-21 59730.507812 60004.425781 58618.929688 58730.476562
2021-11-22 58706.847656 59266.359375 55679.839844 56289.289062
2021-11-23 56304.554688 57875.515625 55632.761719 57569.074219
2021-11-24 57565.851562 57803.066406 55964.222656 56280.425781
2021-11-25 57165.417969 59367.968750 57146.683594 57274.679688
2021-11-26 58960.285156 59183.480469 53569.765625 53569.765625
2021-11-27 53736.429688 55329.257812 53668.355469 54815.078125
2021-11-28 54813.023438 57393.843750 53576.734375 57248.457031
2021-11-29 57474.843750 58749.250000 56856.371094 58749.250000
Adj Close Volume
Date
2021-10-29 62227.964844 36856881767
2021-10-30 61888.832031 32157938616
2021-10-31 61318.957031 32241199927
2021-11-01 61004.406250 36150572843
2021-11-02 63226.402344 37746665647
2021-11-03 62970.046875 36124731509
2021-11-04 61452.230469 32615846901
2021-11-05 61125.675781 30605102446
2021-11-06 61527.480469 29094934221
2021-11-07 63326.988281 24726754302
2021-11-08 67566.828125 41125608330
2021-11-09 66971.828125 42357991721
2021-11-10 64995.230469 48730828378
2021-11-11 64949.960938 35880633236
2021-11-12 64155.941406 36084893887
2021-11-13 64469.527344 30474228777
2021-11-14 65466.839844 25122092191
2021-11-15 63557.871094 30558763548
2021-11-16 60161.246094 46844335592
2021-11-17 60368.011719 39178392930
2021-11-18 56942.136719 41388338699
2021-11-19 58119.578125 38702407772
2021-11-20 59697.195312 30624264863
2021-11-21 58730.476562 26123447605
2021-11-22 56289.289062 35036121783
2021-11-23 57569.074219 37485803899
2021-11-24 56280.425781 36635566789
2021-11-25 57274.679688 34284016248
2021-11-26 53569.765625 41810748221
2021-11-27 54815.078125 30560857714
2021-11-28 57248.457031 28116886357
2021-11-29 58749.250000 33326104576
psar = df.ta.psar(high=df['High'], low=df['Low'], close=df['Close'], af0=0.02, af=0.02, max_af=0.2)
print(psar)
#PSAR r u need
mypsar = pd.DataFrame(psar)["PSARr_0.2_0.2"].iloc[-1]
if psar > 0:
mysartrend = "Bull"
else:
mysartrend = "Bear"
print(mysartrend)

Pandas: minute based column, need to add 15 seconds at each row

My dataframe looks like this:
1 2019-04-22 00:01:00
2 2019-04-22 00:01:00
3 2019-04-22 00:01:00
4 2019-04-22 00:01:00
5 2019-04-22 00:02:00
6 2019-04-22 00:02:00
7 2019-04-22 00:02:00
8 2019-04-22 00:02:00
9 2019-04-22 00:03:00
10 2019-04-22 00:03:00
11 2019-04-22 00:03:00
12 2019-04-22 00:03:00
As you can see there are four rows for each minute, what I would need is to add 15 secondes to each row so that it looks like this:
1 2019-04-22 00:01:00
2 2019-04-22 00:01:15
3 2019-04-22 00:01:30
4 2019-04-22 00:01:45
5 2019-04-22 00:02:00
6 2019-04-22 00:02:15
7 2019-04-22 00:02:30
8 2019-04-22 00:02:45
9 2019-04-22 00:03:00
10 2019-04-22 00:03:15
11 2019-04-22 00:03:30
12 2019-04-22 00:03:45
Any idea on how to proceed? I am not really good at datetime object so I am a bit stuck on that one... thank you in advance!
You can add timedeltas to datetimes column:
df['date'] += pd.to_timedelta(df.groupby('date').cumcount() * 15, unit='s')
print (df)
date
1 2019-04-22 00:01:00
2 2019-04-22 00:01:15
3 2019-04-22 00:01:30
4 2019-04-22 00:01:45
5 2019-04-22 00:02:00
6 2019-04-22 00:02:15
7 2019-04-22 00:02:30
8 2019-04-22 00:02:45
9 2019-04-22 00:03:00
10 2019-04-22 00:03:15
11 2019-04-22 00:03:30
12 2019-04-22 00:03:45
Detail:
First create counter Series by GroupBy.cumcount:
print (df.groupby('date').cumcount())
1 0
2 1
3 2
4 3
5 0
6 1
7 2
8 3
9 0
10 1
11 2
12 3
dtype: int64
Multiple by 15 and convert to seconds timedeltas by to_timedelta:
print (pd.to_timedelta(df.groupby('date').cumcount() * 15, unit='s'))
1 00:00:00
2 00:00:15
3 00:00:30
4 00:00:45
5 00:00:00
6 00:00:15
7 00:00:30
8 00:00:45
9 00:00:00
10 00:00:15
11 00:00:30
12 00:00:45
dtype: timedelta64[ns]

Categories