I have a simple definition which finds the third friday of the month. I use this function to populate the dataframe for the third fridays and that part works fine.
The trouble I'm having is finding the third friday for an expiration_date that doesn't fall on a third friday.
This is my code simplified:
import pandas as pd
def is_third_friday(d):
return d.weekday() == 4 and 15 <= d.day <= 21
x = ['09/23/2022','09/26/2022','09/28/2022','09/30/2022','10/3/2022','10/5/2022',
'10/7/2022','10/10/2022','10/12/2022','10/14/2022','10/17/2022','10/19/2022','10/21/2022',
'10/24/2022','10/26/2022','10/28/2022','11/4/2022','11/18/2022','12/16/2022','12/30/2022',
'01/20/2023','03/17/2023','03/31/2023','06/16/2023','06/30/2023','09/15/2023','12/15/2023',
'01/19/2024','06/21/2024','12/20/2024','01/17/2025']
df = pd.DataFrame(x)
df.rename( columns={0 :'expiration_date'}, inplace=True )
df['expiration_date']= pd.to_datetime(df['expiration_date'])
expiration_date = df['expiration_date']
df["is_third_friday"] = [is_third_friday(x) for x in expiration_date]
third_fridays = df.loc[df['is_third_friday'] == True]
df["current_monthly_exp"] = third_fridays['expiration_date'].min()
df["monthly_exp"] = third_fridays[['expiration_date']]
df.to_csv(path_or_buf = f'C:/Data/Date Dataframe.csv',index=False)
What I'm looking for is any expiration_date that is prior to the monthly expire, I want to populate the dataframe with that monthly expire. If it's past the monthly expire date I want to populate the dataframe with the following monthly expire.
I thought I'd be able to use a new dataframe with only the monthly expires as a lookup table and do a timedelta, but when you look at 4/21/2023 and 7/21/2023 these dates don't exist in that dataframe.
This is my current output:
This is the output I'm seeking:
I was thinking I could handle this problem with something like:
date_df["monthly_exp"][0][::-1].expanding().min()[::-1]
But, it wouldn't solve for the 4/21/2023 and 7/21/2023 problem. Additionally, Pandas won't let you do this in a datetime dataframe.
>>> df = pd.DataFrame([1, nan,2,nan,nan,nan,4])
>>> df
0
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 NaN
6 4.0
>>> df["b"] = df[0][::-1].expanding().min()[::-1]
>>> df
0 b
0 1.0 1.0
1 NaN 2.0
2 2.0 2.0
3 NaN 4.0
4 NaN 4.0
5 NaN 4.0
6 4.0 4.0
I've also tried something like the following in many different forms with little luck:
if df['is_third_friday'].any() == True:
df["monthly_exp"] = third_fridays[['expiration_date']]
else:
df["monthly_exp"] = third_fridays[['expiration_date']].shift(third_fridays)
Any suggestions to get me in the right direction would be appreciated. I've been stuck on this problem for sometime.
You could add these additional lines of code (to replace df["monthly_exp"] = third_fridays[['expiration_date']]:
# DataFrame of fridays from minimum expiration_date to 30 days after last
fri_3s = pd.DataFrame(pd.date_range(df["expiration_date"].min(),
df["expiration_date"].max()+pd.tseries.offsets.Day(30),
freq="W-FRI"),
columns=["monthly_exp"])
# only keep those that are between 15th and 21st (as your function did)
fri_3s = fri_3s[fri_3s.monthly_exp.dt.day.between(15, 21)]
# merge_asof to get next third friday
df = pd.merge_asof(df, fri_3s, left_on="expiration_date", right_on="monthly_exp", direction="forward")
This creates a second DataFrame with the 3rd Fridays, and then by merging with merge_asof returns the next of these from the expiration_date.
And to simplify your date_df["monthly_exp"][0][::-1].expanding().min()[::-1] and use it for datetime, you could instead write df["monthly_exp"].bfill() (which backward fills). As you mentioned, this will only include Fridays that exist in your DataFrame already, so creating a list of the possible Fridays might be the easiest way.
Related
In today's year, if the difference in the year of the corresponding column is 5 or more, it is designed to output 1, but the NaN value comes out.
import pandas as pd
from datetime import datetime
today = datetime.today()
def time(x):
if today.year - x.year > 5:
x = 1
return x
else:
x = 0
return x
df['VIP'] = df[condition]['DaysSinceJoined'].apply(time)
df['VIP']
Get an error:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
..
2235 NaN
2236 NaN
2237 NaN
2238 NaN
2239 NaN
Name: VIP, Length: 2240, dtype: float64
The function works just fine. The issue might lie within your initial condition:
Fist lets generate a bit sample data:
foo = pd.DataFrame({'time':['1979-11-10','1962-07-22','1987-09-16','2020-09-16']})
from datetime import datetime
today = datetime.today()
def time(x):
if today.year - x.year > 5:
return 1
else:
return 0
First we make sure it's not data format issue as I suggested above:
foo['VIP'] =foo['time'].apply(time)
'str' object has no attribute 'year'
We fix this by converting the dates to datetime:
foo['time'] = pd.to_datetime(foo['time'])
Lets test the function:
foo['VIP'] =foo['time'].apply(time)
time VIP
0 1979-11-10 1
1 1962-07-22 1
2 1987-09-16 1
3 2020-09-16 0
All good.
Now lets apply some random condition:
foo['VIP'] =foo[foo['time'].dt.year >1980]['time'].apply(time)
time VIP
0 1979-11-10 NaN
1 1962-07-22 NaN
2 1987-09-16 1.0
3 2020-09-16 0.0
Reason is that you first filter your dataframe to smaller bit and then feed those rows to your function. Because they are never processed they don't get return values.
I suggest you do this with .loc function:
foo.loc[(( today.year - foo['time'].dt.year > 5 ) & (Other_condition_here), 'vip'] = 1
foo.loc[(( today.year - foo['time'].dt.year <= 5 ) & (Other_condition_here), 'vip'] = 0
For more about .loc see documentation
I guess when you use .apply it takes several arguments. Use map:
df['VIP'] = df[condition]['DaysSinceJoined'].map(time)
or:
df['VIP'] = df[condition].apply(lambda x: time(x['DaysSinceJoined']))
If it didn't work, show us some sample data.
I'm not 100% if using apply + functools.reduce is the best approach for this problem, but I'm not sure exactly if multi-indices can be leveraged to accomplish this goal.
Background
The data is a list of activities performed by accounts.
user_id - the user's ID
activity - the string that represents the activity
performed_at - the timestamp when the activity was completed at
The Goal
To calculate the time spent between 2 statuses. An account can look like this:
user_id
activity
performed_at
1
activated
2020-01-01
1
deactivated
2020-01-02
1
activated
2020-01-03
In this example, the user was deactivated from January 1st to January 2nd, so the total "Time Deactivated" for this account would be 1 day.
Resulting Dataframe
Here's an example of the output I'm trying to achieve. The time_spent_deactivated column is just the addition of all deactivation periods on all accounts grouped by account.
user_id
time_spent_deactivated
1
24 hours
2
15 hours
3
72 hours
My Attempt
I'm trying to leverage .apply with the .groupby on the user_id, but I'm stuck at the point of calculating the total time spent in the deactivated state:
def calculate_deactivation_time(activities):
# reduce the given dataframe here
# this is totally ActiveRecord & JS inspired but it's the easiest way for me to describe how I expect to solve this
return activities.reduce(sum, current_row):
if current_row['activity'] == 'deactivated':
# find next "activated" activity and compute the delta
reactivated_row = activities.after(current_row).where(activity, '=', 'activated')
return sum + (reactivated_row['performed_at'] - current_row['performed_at'])
grouped = activities.groupby('user_id')
grouped.apply(calculate_deactivation_time)
Is there a better approach to doing this? I tried to use functools.reduce to compute the total time spent deactivated, but it doesn't support dataframes out of the box.
I have given it some more thought and I think this is what you are looking for.
import pandas as pd
import numpy as np
def myfunc(x):
df = x
# Shift columns activity_start and performed_at_start
df['performed_at_end'] = df['performed_at_start'].shift(-1)
df['activity_end'] = df['activity_start'].shift(-1)
# Make combinations of activity start-end
df['activity_start_end'] = df['activity_start']+'-'+df['activity_end']
# Take only those that start with deactivated, end with activated
df = df[df['activity_start_end']=='deactivated-activated']
# Drop all that don't have performed_at_end date (does not exist)
df = df[~pd.isna(df['performed_at_end'])]
# Compute time difference in days, then return sum of all delta's
df['delta'] = (df['performed_at_end']-df['performed_at_start'])/np.timedelta64(1,'D')
return df['delta'].sum()
# Example dataframe
df = pd.DataFrame({'UserId': [1]*4+[2]*2+[3]*1,
'activity_start': ['activated', 'deactivated', 'activated', 'deactivated', 'deactivated', 'activated', 'activated'],
'performed_at_start': [pd.Timestamp(2020,1,1), pd.Timestamp(2020,1,2), pd.Timestamp(2020,1,6), pd.Timestamp(2020,1,8),
pd.Timestamp(2020,1,1), pd.Timestamp(2020,1,3), pd.Timestamp(2020,1,1)]})
# Show dataframe
print(df)
UserId activity_start performed_at_start
0 1 activated 2020-01-01
1 1 deactivated 2020-01-02
2 1 activated 2020-01-06
3 1 deactivated 2020-01-08
4 2 deactivated 2020-01-01
5 2 activated 2020-01-03
6 3 activated 2020-01-01
# Compute result
res = (
df.groupby(by='UserId')
.apply(lambda x: myfunc(x)).reset_index(drop=False)
)
res.columns = ['UserId', 'time_spent_deactivated']
# Show result
print(res)
UserId time_spent_deactivated
0 1 4.0
1 2 2.0
2 3 0.0
Hi I am using the date difference as a machine learning feature, analyzing how the weight of a patient changed over time.
I successfully tested a method to do that as shown below, but the question is how to extend this to a dataframe where I have to see date difference for each patient as shown in the figure above. The encircled column is what im aiming to get. So basically the baseline date from which the date difference is calculated changes every time for a new patient name so that we can track the weight progress over time for that patient! Thanks
s='17/6/2016'
s1='22/6/16'
a=pd.to_datetime(s,infer_datetime_format=True)
b=pd.to_datetime(s1,infer_datetime_format=True)
e=b.date()-a.date()
str(e)
str(e)[0:2]
I think it would be something like this, (but im not sure how to do this exactly):
def f(row):
# some logic here
return val
df['Datediff'] = df.apply(f, axis=1)
You can use transform with first
df['Datediff'] = df['Date'] - df1.groupby('Name')['Date'].transform('first')
Another solution can be using cumsum
df['Datediff'] = df.groupby('Name')['Date'].apply(lambda x:x.diff().cumsum().fillna(0))
df["Datediff"] = df.groupby("Name")["Date"].diff().fillna(0)/ np.timedelta64(1, 'D')
df["Datediff"]
0 0.0
1 12.0
2 14.0
3 66.0
4 23.0
5 0.0
6 10.0
7 15.0
8 14.0
9 0.0
10 14.0
Name: Datediff, dtype: float64
I have a program that ideally measures the temperature every second. However, in reality this does not happen. Sometimes, it skips a second or it breaks down for 400 seconds and then decides to start recording again. This leaves gaps in my 2-by-n dataframe, where ideally n = 86400 (the amount of seconds in a day). I want to apply some sort of moving/rolling average to it to get a nicer plot, but if I do that to the "raw" datafiles, the amount of data points becomes less. This is shown here, watch the x-axis. I know the "nice data" doesn't look nice yet; I'm just playing with some values.
So, I want to implement a data cleaning method, which adds data to the dataframe. I thought about it, but don't know how to implement it. I thought of it as follows:
If the index is not equal to the time, then we need to add a number, at time = index. If this gap is only 1 value, then the average of the previous number and the next number will do for me. But if it is bigger, say 100 seconds are missing, then a linear function needs to be made, which will increase or decrease the value steadily.
So I guess a training set could be like this:
index time temp
0 0 20.10
1 1 20.20
2 2 20.20
3 4 20.10
4 100 22.30
Here, I would like to get a value for index 3, time 3 and the values missing between time = 4 and time = 100. I'm sorry about my formatting skills, I hope it is clear.
How would I go about programming this?
Use merge with complete time column and then interpolate:
# Create your table
time = np.array([e for e in np.arange(20) if np.random.uniform() > 0.6])
temp = np.random.uniform(20, 25, size=len(time))
temps = pd.DataFrame([time, temp]).T
temps.columns = ['time', 'temperature']
>>> temps
time temperature
0 4.0 21.662352
1 10.0 20.904659
2 15.0 20.345858
3 18.0 24.787389
4 19.0 20.719487
The above is a random table generated with missing time data.
# modify it
filled = pd.Series(np.arange(temps.iloc[0,0], temps.iloc[-1, 0]+1))
filled = filled.to_frame()
filled.columns = ['time'] # Create a fully filled time column
merged = pd.merge(filled, temps, on='time', how='left') # merge it with original, time without temperature will be null
merged.temperature = merged.temperature.interpolate() # fill nulls linearly.
# Alternatively, use reindex, this does the same thing.
final = temps.set_index('time').reindex(np.arange(temps.time.min(),temps.time.max()+1)).reset_index()
final.temperature = final.temperature.interpolate()
>>> merged # or final
time temperature
0 4.0 21.662352
1 5.0 21.536070
2 6.0 21.409788
3 7.0 21.283505
4 8.0 21.157223
5 9.0 21.030941
6 10.0 20.904659
7 11.0 20.792898
8 12.0 20.681138
9 13.0 20.569378
10 14.0 20.457618
11 15.0 20.345858
12 16.0 21.826368
13 17.0 23.306879
14 18.0 24.787389
15 19.0 20.719487
First you can set the second values to actual time values as such:
df.index = pd.to_datetime(df['time'], unit='s')
After which you can use pandas' built-in time series operations to resample and fill in the missing values:
df = df.resample('s').interpolate('time')
Optionally, if you still want to do some smoothing you can use the following operation for that:
df.rolling(5, center=True, win_type='hann').mean()
Which will smooth with a 5 element wide Hanning window. Note: any window-based smoothing will cost you value points at the edges.
Now your dataframe will have datetimes (including date) as index. This is required for the resample method. If you want to lose the date, you can simply use:
df.index = df.index.time
Say that I have a pandas df which contains financial time-series data with datetime index. An example:
x = ['10-06-2016', '10-07-2016', '10-10-2016', '10-11-2016', '10-12-2016']
y = [0,1,2,3,4]
Note that I don't have time-series values on weekends, which is why '10-08-2016' and '10-09-2016' are not printed on dataframe index.
I wish to create a new y vector which places None in spots where weekends are.
So ideal output:
x = ['10-06-2016', '10-07-2016', '10-08-2016', '10-09-2016', '10-10-2016', '10-11-2016', '10-12-2016']
y = [0,1,None,None,2,3,4]
What's the best way to accomplish this? Since x is not printing the weekends, how do I search x is weekend and then insert None values to y?
You can reindex the data frame which has a datetimeIndex with a wider range of date as follows, missing values will be filled with NaN:
import pandas as pd
df = pd.DataFrame({'Value': y}, index=pd.to_datetime(x))
# Value
#2016-10-06 0
#2016-10-07 1
#2016-10-10 2
#2016-10-11 3
#2016-10-12 4
df.reindex(pd.date_range(start = df.index.min(), end = df.index.max()))
# Value
#2016-10-06 0.0
#2016-10-07 1.0
#2016-10-08 NaN
#2016-10-09 NaN
#2016-10-10 2.0
#2016-10-11 3.0
#2016-10-12 4.0