Python List - set every n-th value None - python

As the title says, i want to know how to set every n-th value in a python list as Null. I looked after a solution in a lot of forums but i didn't find much.
I also don't want to overwrite existing values as None, instead i want to create new spaces with the value None
The list contains the date (12 dates = 1 year) and every 13th value should be empty because that row will be the average so i don't need a date
Here is my code how i generated the dates with pandas
import pandas as pd
numdays = 370 #i have 370 values, every day = 1 month. Starting from 1990 till June 2019
date1 = '1990-01-01'
date2 = '2019-06-01'
mydates = pd.date_range(date1, date2,).tolist()
date_all = pd.date_range(start=date1, end=date2, freq='1BMS')
date_lst = [date_all]
The expected Output:
01.01.1990
01.02.1990
01.03.1990
01.04.1990
01.05.1990
01.06.1990
01.07.1990
01.08.1990
01.09.1990
01.10.1990
01.11.1990
01.12.1990
None
01.01.1991
.
.
.

If I understood correctly:
import pandas as pd
numdays = 370
date1 = '1990-01-01'
date2 = '2019-06-01'
mydates = pd.date_range(date1, date2,).tolist()
date_all = pd.date_range(start=date1, end=date2, freq='1BMS')
date_lst = [date_all]
for i in range(12,len(mydates),13): # add this
mydates.insert(i, None)

I saw some of the answers above, but there's a way of doing this without having to loop over the complete list:
date_lst[12::12] = [None] * len(date_lst[12::12])
The first 12 in [12::12] means that the first item that should be changed is item number 12. The second 12 means that from then on every 12th item should be changed.

You add a step in iloc and set values this way.
lets generate some dummy data.
df = pd.DataFrame({'Vals' :
pd.date_range('01-01-19','02-02-19',freq='D')})
print(df)
Vals
0 2019-01-01
1 2019-01-02
2 2019-01-03
3 2019-01-04
4 2019-01-05
5 2019-01-06
6 2019-01-07
7 2019-01-08
now, you can decide your step
step = 5
new_df = df.iloc[step::step]
print(new_df)
Vals
5 2019-01-06
10 2019-01-11
15 2019-01-16
20 2019-01-21
25 2019-01-26
30 2019-01-31
now, if you want to write a value to a specific column then -
df['Vals'].iloc[step::step] = pd.NaT
print(df)
Vals
0 2019-01-01
1 2019-01-02
2 2019-01-03
3 2019-01-04
4 2019-01-05
5 NaT

Here is an example of setting null if the element of the list is in the 3rd position, you can make this 13th position by changed ((index+1)%13 == 0)
data = [1,2,3,4,5,6,7,8,9]
data = [None if ((index+1)%3 == 0) else d for index, d in enumerate(data)]
print(data)
output:
[1, 2, None, 4, 5, None, 7, 8, None]
According to your code try this:
date_lst = list(date_all)
dateWithNone = [None if ((index+1)%13 == 0) else d for index, d in enumerate(date_lst)]
print(dateWithNone)

Related

Subtracting times in a csv for a row by row basis in Python

I have a CSV that's few thousand rows long. It contains data sent from various devices. They should transmit frequently (every 10 minutes) however sometimes there is a lag. I'm trying to write a program that will highlight all instances where the delay between two readings is greater than 15 minutes
I've made a functional code that works, but with this code I first have to manually edit the CSV to change the "eventTime" variable from time format (e.g. 03:22:00) to a float value based on 1/24 (e.g. 03:22:00 becomes 0.14027). Similarly, the 15 minute interval becomes 0.01042 (15/(60*24))
import pandas as pd
df = pd.read_csv('file.csv')
df2 = pd.DataFrame()
deviceID = df["deviceId"].unique().tolist()
threshold = 0.01042
for id_no in range(0, len(deviceID)):
subset = df[df.deviceId == deviceID[id_no]]
for row in range(len(subset)-1):
difference = subset.iloc[row, 1] - subset.iloc[row+1, 1]
if difference > threshold:
df2 = df2.append(subset.iloc[row])
df2 = df2.append(subset.iloc[row+1])
df2.to_csv('file2.csv)
This works, and I can open the CSV in excel and manually change the float values back to time format, but when I might be dealing with a few hundred CSV files, this becomes impractical,
I've tried this below
import pandas as pd
from datetime import datetime
df = pd.read_csv('file.csv')
df2 = pd.DataFrame()
deviceID = df["deviceId"].unique().tolist()
df['eventTime'].apply(lambda x: datetime.strptime(x, "%H:%M:%S"))
threshold = datetime.strptime("00:15:00", '%H:%M:%S')
for id_no in range(0, len(deviceID)):
subset = df[df.deviceId == deviceID[id_no]]
for row in range(len(subset)-1):
difference = datetime.strptime(subset.iloc[row, 1],'%H:%M:%S') - datetime.strptime(subset.iloc[row+1, 1], '%H:%M:%S')
if difference > threshold:
df2 = df2.append(subset.iloc[row])
df2 = df2.append(subset.iloc[row+1])
df2.to_csv('file2.csv')
but I get the following error:
if difference > threshold:
TypeError: '>' not supported between instances of 'datetime.timedelta' and 'datetime.datetime'
The data looks like this:
| eventTime| deviceId|
| -------- | -------- |
| 15:30:00 | 11234889|
| 15:45:00 | 11234889|
| 16:00:00 | 11234889|
and for different IDs
| eventTime| deviceId|
| -------- | -------- |
| 15:30:00 | 11234890|
| 15:45:00 | 11234890|
| 16:00:00 | 11234890|
threshold is datetime and you compare it to timedelta object (difference). Did you mean:
from datetime import timedelta
...
threshold = datetime.timedelta(minutes=15)
Given this dataframe:
actual_ts id
0 05:00:00 SPAM
1 5:15:00 SPAM
2 5:33:00 SPAM <-- Should highlight
3 5:45:00 SPAM
4 6:02:00 SPAM <-- Should highlight
5 11:15:00 FOO
6 11:32:00 FOO <-- Should highlight
7 11:45:00 FOO
8 12:08:00 FOO <-- Should highlight
This is a step-by-step way of getting to where you want, definitely not the most optimal but it's clear enough to teach you how to avoid looping over dataframes, which is a major no-no. Try running and printing the dataframe every step so you know what's happening.
# Convert column to timedelta.
df["actual_ts"] = pd.to_timedelta(df["actual_ts"])
# Sort as a best practice if not computationally expensive.
df = df.sort_values(by=["id", "actual_ts"])
# Shift the actual_ts by one row per group.
df["lagged_ts"] = df.groupby(["id"])["actual_ts"].shift(1)
# Fill nulls with same time if you want to avoid NaNs and NaTs.
df["lagged_ts"] = df["lagged_ts"].fillna(df["actual_ts"])
# Calculate difference in seconds.
df["diff_seconds"] = (df["actual_ts"] - df["lagged_ts"]).dt.seconds
# Mark as True all events greater than 15 minutes.
df["highlight"] = df["diff_seconds"] > 900
# Keep all columns you need.
new_df = df[["actual_ts", "id", "diff_seconds", "highlight"]]
You get this:
actual_ts id diff_seconds highlight
5 0 days 11:15:00 FOO 0 False
6 0 days 11:32:00 FOO 1020 True
7 0 days 11:45:00 FOO 780 False
8 0 days 12:08:00 FOO 1380 True
0 0 days 05:00:00 SPAM 0 False
1 0 days 05:15:00 SPAM 900 False
2 0 days 05:33:00 SPAM 1080 True
3 0 days 05:45:00 SPAM 720 False
4 0 days 06:02:00 SPAM 1020 True
Cleaning up the 0 days is up to you. You can also change diff_seconds to minutes but that's easy enough.

Finding the third Friday for an expiration date using pandas datetime

I have a simple definition which finds the third friday of the month. I use this function to populate the dataframe for the third fridays and that part works fine.
The trouble I'm having is finding the third friday for an expiration_date that doesn't fall on a third friday.
This is my code simplified:
import pandas as pd
def is_third_friday(d):
return d.weekday() == 4 and 15 <= d.day <= 21
x = ['09/23/2022','09/26/2022','09/28/2022','09/30/2022','10/3/2022','10/5/2022',
'10/7/2022','10/10/2022','10/12/2022','10/14/2022','10/17/2022','10/19/2022','10/21/2022',
'10/24/2022','10/26/2022','10/28/2022','11/4/2022','11/18/2022','12/16/2022','12/30/2022',
'01/20/2023','03/17/2023','03/31/2023','06/16/2023','06/30/2023','09/15/2023','12/15/2023',
'01/19/2024','06/21/2024','12/20/2024','01/17/2025']
df = pd.DataFrame(x)
df.rename( columns={0 :'expiration_date'}, inplace=True )
df['expiration_date']= pd.to_datetime(df['expiration_date'])
expiration_date = df['expiration_date']
df["is_third_friday"] = [is_third_friday(x) for x in expiration_date]
third_fridays = df.loc[df['is_third_friday'] == True]
df["current_monthly_exp"] = third_fridays['expiration_date'].min()
df["monthly_exp"] = third_fridays[['expiration_date']]
df.to_csv(path_or_buf = f'C:/Data/Date Dataframe.csv',index=False)
What I'm looking for is any expiration_date that is prior to the monthly expire, I want to populate the dataframe with that monthly expire. If it's past the monthly expire date I want to populate the dataframe with the following monthly expire.
I thought I'd be able to use a new dataframe with only the monthly expires as a lookup table and do a timedelta, but when you look at 4/21/2023 and 7/21/2023 these dates don't exist in that dataframe.
This is my current output:
This is the output I'm seeking:
I was thinking I could handle this problem with something like:
date_df["monthly_exp"][0][::-1].expanding().min()[::-1]
But, it wouldn't solve for the 4/21/2023 and 7/21/2023 problem. Additionally, Pandas won't let you do this in a datetime dataframe.
>>> df = pd.DataFrame([1, nan,2,nan,nan,nan,4])
>>> df
0
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 NaN
6 4.0
>>> df["b"] = df[0][::-1].expanding().min()[::-1]
>>> df
0 b
0 1.0 1.0
1 NaN 2.0
2 2.0 2.0
3 NaN 4.0
4 NaN 4.0
5 NaN 4.0
6 4.0 4.0
I've also tried something like the following in many different forms with little luck:
if df['is_third_friday'].any() == True:
df["monthly_exp"] = third_fridays[['expiration_date']]
else:
df["monthly_exp"] = third_fridays[['expiration_date']].shift(third_fridays)
Any suggestions to get me in the right direction would be appreciated. I've been stuck on this problem for sometime.
You could add these additional lines of code (to replace df["monthly_exp"] = third_fridays[['expiration_date']]:
# DataFrame of fridays from minimum expiration_date to 30 days after last
fri_3s = pd.DataFrame(pd.date_range(df["expiration_date"].min(),
df["expiration_date"].max()+pd.tseries.offsets.Day(30),
freq="W-FRI"),
columns=["monthly_exp"])
# only keep those that are between 15th and 21st (as your function did)
fri_3s = fri_3s[fri_3s.monthly_exp.dt.day.between(15, 21)]
# merge_asof to get next third friday
df = pd.merge_asof(df, fri_3s, left_on="expiration_date", right_on="monthly_exp", direction="forward")
This creates a second DataFrame with the 3rd Fridays, and then by merging with merge_asof returns the next of these from the expiration_date.
And to simplify your date_df["monthly_exp"][0][::-1].expanding().min()[::-1] and use it for datetime, you could instead write df["monthly_exp"].bfill() (which backward fills). As you mentioned, this will only include Fridays that exist in your DataFrame already, so creating a list of the possible Fridays might be the easiest way.

check if each user has consecutive dates in a python 3 pandas dataframe

Imagine there is a dataframe:
id date balance_total transaction_total
0 1 01/01/2019 102.0 -1.0
1 1 01/02/2019 100.0 -2.0
2 1 01/03/2019 100.0 NaN
3 1 01/04/2019 100.0 NaN
4 1 01/05/2019 96.0 -4.0
5 2 01/01/2019 200.0 -2.0
6 2 01/02/2019 100.0 -2.0
7 2 01/04/2019 100.0 NaN
8 2 01/05/2019 96.0 -4.0
here is the create dataframe command:
import pandas as pd
import numpy as np
users=pd.DataFrame(
[
{'id':1,'date':'01/01/2019', 'transaction_total':-1, 'balance_total':102},
{'id':1,'date':'01/02/2019', 'transaction_total':-2, 'balance_total':100},
{'id':1,'date':'01/03/2019', 'transaction_total':np.nan, 'balance_total':100},
{'id':1,'date':'01/04/2019', 'transaction_total':np.nan, 'balance_total':100},
{'id':1,'date':'01/05/2019', 'transaction_total':-4, 'balance_total':np.nan},
{'id':2,'date':'01/01/2019', 'transaction_total':-2, 'balance_total':200},
{'id':2,'date':'01/02/2019', 'transaction_total':-2, 'balance_total':100},
{'id':2,'date':'01/04/2019', 'transaction_total':np.nan, 'balance_total':100},
{'id':2,'date':'01/05/2019', 'transaction_total':-4, 'balance_total':96}
]
)
How could I check if each id has consecutive dates or not? I use the
"shift" idea here but it doesn't seem to work:
Calculating time difference between two rows
df['index_col'] = df.index
for id in df['id'].unique():
# create an empty QA dataframe
column_names = ["Delta"]
df_qa = pd.DataFrame(columns = column_names)
df_qa['Delta']=(df['index_col'] - df['index_col'].shift(1))
if (df_qa['Delta'].iloc[1:] != 1).any() is True:
print('id ' + id +' might have non-consecutive dates')
# doesn't print any account => Each Customer's Daily Balance has Consecutive Dates
break
Ideal output:
it should print id 2 might have non-consecutive dates
Thank you!
Use groupby and diff:
df["date"] = pd.to_datetime(df["date"],format="%m/%d/%Y")
df["difference"] = df.groupby("id")["date"].diff()
print (df.loc[df["difference"]>pd.Timedelta(1, unit="d")])
#
id date transaction_total balance_total difference
7 2 2019-01-04 NaN 100.0 2 days
Use DataFrameGroupBy.diff with Series.dt.days, compre by greatee like 1 and filter only id column by DataFrame.loc:
users['date'] = pd.to_datetime(users['date'])
i = users.loc[users.groupby('id')['date'].diff().dt.days.gt(1), 'id'].tolist()
print (i)
[2]
for val in i:
print( f'id {val} might have non-consecutive dates')
id 2 might have non-consecutive dates
First step is to parse date:
users['date'] = pd.to_datetime(users.date).
Then add a shifted column on the id and date columns:
users['id_shifted'] = users.id.shift(1)
users['date_shifted'] = users.date.shift(1)
The difference between date and date_shifted columns is of interest:
>>> users.date - users.date_shifted
0 NaT
1 1 days
2 1 days
3 1 days
4 1 days
5 -4 days
6 1 days
7 2 days
8 1 days
dtype: timedelta64[ns]
You can now query the DataFrame for what you want:
users[(users.id_shifted == users.id) & (users.date_shifted - users.date != np.timedelta64(days=1))]
That is, consecutive lines of the same user with a date difference != 1 day.
This solution does assume the data is sorted by (id, date).

How to perform functions that reference previous row on subset of data in a dataframe using groupby

I have some log data that represents an item (id) and a timestamp that an action was a started and I want to determine the time between actions on each item.
for example, I have some data that looks like this:
data = [{"timestamp":"2019-05-21T14:17:29.265Z","id":"ff9dad92-e7c1-47a5-93a7-6e49533a6e25"},{"timestamp":"2019-05-21T14:21:49.722Z","id":"ff9dad92-e7c1-47a5-93a7-6e49533a6e25"},{"timestamp":"2019-05-21T15:16:25.695Z","id":"ff9dad92-e7c1-47a5-93a7-6e49533a6e25"},{"timestamp":"2019-05-21T15:16:25.696Z","id":"ff9dad92-e7c1-47a5-93a7-6e49533a6e25"},{"timestamp":"2019-05-22T07:51:17.49Z","id":"ff12891e-5786-438b-891c-abd4244723b4"},{"timestamp":"2019-05-22T08:11:13.948Z","id":"ff12891e-5786-438b-891c-abd4244723b4"},{"timestamp":"2019-05-22T11:52:59.897Z","id":"ff12891e-5786-438b-891c-abd4244723b4"},{"timestamp":"2019-05-22T11:53:03.406Z","id":"ff12891e-5786-438b-891c-abd4244723b4"},{"timestamp":"2019-05-22T11:53:03.481Z","id":"ff12891e-5786-438b-891c-abd4244723b4"},{"timestamp":"2019-05-21T14:23:08.147Z","id":"fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa"},{"timestamp":"2019-05-21T14:29:18.228Z","id":"fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa"},{"timestamp":"2019-05-21T15:17:09.831Z","id":"fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa"},{"timestamp":"2019-05-21T15:17:09.834Z","id":"fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa"},{"timestamp":"2019-05-21T14:02:19.072Z","id":"fd3554cd-b83d-49af-a8e6-7bf41c741cd0"},{"timestamp":"2019-05-21T14:02:34.867Z","id":"fd3554cd-b83d-49af-a8e6-7bf41c741cd0"},{"timestamp":"2019-05-21T14:12:28.877Z","id":"fd3554cd-b83d-49af-a8e6-7bf41c741cd0"},{"timestamp":"2019-05-21T15:19:19.567Z","id":"fd3554cd-b83d-49af-a8e6-7bf41c741cd0"},{"timestamp":"2019-05-21T15:19:19.582Z","id":"fd3554cd-b83d-49af-a8e6-7bf41c741cd0"},{"timestamp":"2019-05-21T09:58:02.185Z","id":"f89c2e3e-06dc-467b-813b-dc92f2692f63"},{"timestamp":"2019-05-21T10:07:24.044Z","id":"f89c2e3e-06dc-467b-813b-dc92f2692f63"}]
stack = pd.DataFrame(data)
stack.head()
I have tried getting all the unique ids to split the data frame and then getting the time taken with the index to recombine with the original set like, but the function is extremely slow on large data-sets and messes up both the index
and timestamp order resulting in results getting miss matched.
import ciso8601 as time
records = []
for i in list(stack.id.unique()):
dff = stack[stack.id == i]
time_taken = []
times = []
i = 0
for _, row in dff.iterrows():
if bool(times):
print(_)
current_time = time.parse_datetime(row.timestamp)
prev_time = times[i]
time_taken = current_time - prev_time
times.append(current_time)
i+=1
records.append(dict(index = _, time_taken = time_taken.seconds))
else:
records.append(dict(index = _, time_taken = 0))
times.append(time.parse_datetime(row.timestamp))
x = pd.DataFrame(records).set_index('index')
stack.merge(x, left_index=True, right_index=True, how='inner')
Is there a neat pandas groupby and apply way of doing this so that I don't have to split the frame and store it in memory so that can reference the previous row in the subset?
Thanks
You can use GroupBy.diff:
stack['timestamp'] = pd.to_datetime(stack['timestamp'])
stack['timestamp']= (stack.sort_values(['id','timestamp'])
.groupby('id')
.diff()['timestamp']
.dt.total_seconds()
.round().fillna(0))
print(stack['time_taken'])
0 0.0
1 260.0
2 3276.0
3 0.0
4 0.0
5 1196.0
6 13306.0
7 4.0
8 0.0
9 0.0
10 370.0
11 2872.0
...
If you want the resulting dataframe to be ordered by date, instead do:
stack['timestamp'] = pd.to_datetime(stack['timestamp'])
stack = stack.sort_values(['id','timestamp'])
stack['time_taken'] = (stack.groupby('id')
.diff()['timestamp']
.dt.total_seconds()
.round()
.fillna(0))
If dont need replace timestamp to datetimes create Series filled by datetimes by to_datetime and pass to DataFrameGroupBy.diff, then convert to seconds by Series.dt.total_seconds, if necessary round by Series.round and replace missing values by 0:
t = pd.to_datetime(stack['timestamp'])
stack['time_taken'] = t.groupby(stack['id']).diff().dt.total_seconds().round().fillna(0)
print (stack)
id timestamp time_taken
0 ff9dad92-e7c1-47a5-93a7-6e49533a6e25 2019-05-21T14:17:29.265Z 0.0
1 ff9dad92-e7c1-47a5-93a7-6e49533a6e25 2019-05-21T14:21:49.722Z 260.0
2 ff9dad92-e7c1-47a5-93a7-6e49533a6e25 2019-05-21T15:16:25.695Z 3276.0
3 ff9dad92-e7c1-47a5-93a7-6e49533a6e25 2019-05-21T15:16:25.696Z 0.0
4 ff12891e-5786-438b-891c-abd4244723b4 2019-05-22T07:51:17.49Z 0.0
5 ff12891e-5786-438b-891c-abd4244723b4 2019-05-22T08:11:13.948Z 1196.0
6 ff12891e-5786-438b-891c-abd4244723b4 2019-05-22T11:52:59.897Z 13306.0
7 ff12891e-5786-438b-891c-abd4244723b4 2019-05-22T11:53:03.406Z 4.0
8 ff12891e-5786-438b-891c-abd4244723b4 2019-05-22T11:53:03.481Z 0.0
9 fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa 2019-05-21T14:23:08.147Z 0.0
10 fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa 2019-05-21T14:29:18.228Z 370.0
11 fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa 2019-05-21T15:17:09.831Z 2872.0
12 fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa 2019-05-21T15:17:09.834Z 0.0
13 fd3554cd-b83d-49af-a8e6-7bf41c741cd0 2019-05-21T14:02:19.072Z 0.0
14 fd3554cd-b83d-49af-a8e6-7bf41c741cd0 2019-05-21T14:02:34.867Z 16.0
15 fd3554cd-b83d-49af-a8e6-7bf41c741cd0 2019-05-21T14:12:28.877Z 594.0
16 fd3554cd-b83d-49af-a8e6-7bf41c741cd0 2019-05-21T15:19:19.567Z 4011.0
17 fd3554cd-b83d-49af-a8e6-7bf41c741cd0 2019-05-21T15:19:19.582Z 0.0
18 f89c2e3e-06dc-467b-813b-dc92f2692f63 2019-05-21T09:58:02.185Z 0.0
19 f89c2e3e-06dc-467b-813b-dc92f2692f63 2019-05-21T10:07:24.044Z 562.0
Or if need replace timestamp to datetimes use #yatu answer.

Date Difference based on matching values in two columns - Pandas

I have a dataframe, I am struggling to create a column based out of other columns, I will share the problem for a sample data.
Date Target1 Close
0 2018-05-25 198.0090 188.580002
1 2018-05-25 197.6835 188.580002
2 2018-05-25 198.0090 188.580002
3 2018-05-29 196.6230 187.899994
4 2018-05-29 196.9800 187.899994
5 2018-05-30 197.1375 187.500000
6 2018-05-30 196.6965 187.500000
7 2018-05-30 196.8750 187.500000
8 2018-05-31 196.2135 186.869995
9 2018-05-31 196.2135 186.869995
10 2018-05-31 196.5600 186.869995
11 2018-05-31 196.7700 186.869995
12 2018-05-31 196.9275 186.869995
13 2018-05-31 196.2135 186.869995
14 2018-05-31 196.2135 186.869995
15 2018-06-01 197.2845 190.240005
16 2018-06-01 197.2845 190.240005
17 2018-06-04 201.2325 191.830002
18 2018-06-04 201.4740 191.830002
I want to create another column (for each observation) (called days_to_hit_target for example) which is the difference of days such that close hits (or crosses target of specific day), then it counts the difference of days and put them in the column days_to_hit_target.
The idea is, suppose close price today in 2018-05-25 is 188.58, so, I want to get the date for which this target (198.0090) is hit close which it is doing somewhere later on 2018-06-04, where close has reached to the target of first observation, (198.0090), that will be fed to the first observation of the column (days_to_hit_target ).
Use a combination of loc and at to find the date at which the target is hit, then subtract the dates.
df['TargetDate'] = 'NA'
for i, row in df.iterrows():
t = row['Target1']
d = row['Date']
targdf = df.loc[df['Close'] >= t]
if len(targdf)>0:
targdt = targdf.at[0,'Date']
df.at[i,'TargetDate'] = targdt
else:
df.at[i,'TargetDate'] = '0'
df['Diff'] = df['Date'].sub(df['TargetDate'], axis=0)
import pandas as pd
csv = pd.read_csv(
'sample.csv',
parse_dates=['Date']
)
csv.sort_values('Date', inplace=True)
def find_closest(row):
target = row['Target1']
date = row['Date']
matches = csv[
(csv['Close'] >= target) &
(csv['Date'] > date)
]
closest_date = matches['Date'].iloc[0] if not matches.empty else None
row['days to hit target'] = (closest_date - date).days if closest_date else None
return row
final = csv.apply(find_closest, axis=1)
It's a bit hard to test because none of the targets appear in the close. But the idea is simple. Subset your original frame such that date is after the current row date and Close is greater than or equal to Target1 and get the first entry (this is after you've sorted it using df.sort_values.
If the subset is empty, use None. Otherwise, use the Date. Days to hit target is pretty simple at that point.

Categories