Hi I am using the date difference as a machine learning feature, analyzing how the weight of a patient changed over time.
I successfully tested a method to do that as shown below, but the question is how to extend this to a dataframe where I have to see date difference for each patient as shown in the figure above. The encircled column is what im aiming to get. So basically the baseline date from which the date difference is calculated changes every time for a new patient name so that we can track the weight progress over time for that patient! Thanks
s='17/6/2016'
s1='22/6/16'
a=pd.to_datetime(s,infer_datetime_format=True)
b=pd.to_datetime(s1,infer_datetime_format=True)
e=b.date()-a.date()
str(e)
str(e)[0:2]
I think it would be something like this, (but im not sure how to do this exactly):
def f(row):
# some logic here
return val
df['Datediff'] = df.apply(f, axis=1)
You can use transform with first
df['Datediff'] = df['Date'] - df1.groupby('Name')['Date'].transform('first')
Another solution can be using cumsum
df['Datediff'] = df.groupby('Name')['Date'].apply(lambda x:x.diff().cumsum().fillna(0))
df["Datediff"] = df.groupby("Name")["Date"].diff().fillna(0)/ np.timedelta64(1, 'D')
df["Datediff"]
0 0.0
1 12.0
2 14.0
3 66.0
4 23.0
5 0.0
6 10.0
7 15.0
8 14.0
9 0.0
10 14.0
Name: Datediff, dtype: float64
Related
I have a simple definition which finds the third friday of the month. I use this function to populate the dataframe for the third fridays and that part works fine.
The trouble I'm having is finding the third friday for an expiration_date that doesn't fall on a third friday.
This is my code simplified:
import pandas as pd
def is_third_friday(d):
return d.weekday() == 4 and 15 <= d.day <= 21
x = ['09/23/2022','09/26/2022','09/28/2022','09/30/2022','10/3/2022','10/5/2022',
'10/7/2022','10/10/2022','10/12/2022','10/14/2022','10/17/2022','10/19/2022','10/21/2022',
'10/24/2022','10/26/2022','10/28/2022','11/4/2022','11/18/2022','12/16/2022','12/30/2022',
'01/20/2023','03/17/2023','03/31/2023','06/16/2023','06/30/2023','09/15/2023','12/15/2023',
'01/19/2024','06/21/2024','12/20/2024','01/17/2025']
df = pd.DataFrame(x)
df.rename( columns={0 :'expiration_date'}, inplace=True )
df['expiration_date']= pd.to_datetime(df['expiration_date'])
expiration_date = df['expiration_date']
df["is_third_friday"] = [is_third_friday(x) for x in expiration_date]
third_fridays = df.loc[df['is_third_friday'] == True]
df["current_monthly_exp"] = third_fridays['expiration_date'].min()
df["monthly_exp"] = third_fridays[['expiration_date']]
df.to_csv(path_or_buf = f'C:/Data/Date Dataframe.csv',index=False)
What I'm looking for is any expiration_date that is prior to the monthly expire, I want to populate the dataframe with that monthly expire. If it's past the monthly expire date I want to populate the dataframe with the following monthly expire.
I thought I'd be able to use a new dataframe with only the monthly expires as a lookup table and do a timedelta, but when you look at 4/21/2023 and 7/21/2023 these dates don't exist in that dataframe.
This is my current output:
This is the output I'm seeking:
I was thinking I could handle this problem with something like:
date_df["monthly_exp"][0][::-1].expanding().min()[::-1]
But, it wouldn't solve for the 4/21/2023 and 7/21/2023 problem. Additionally, Pandas won't let you do this in a datetime dataframe.
>>> df = pd.DataFrame([1, nan,2,nan,nan,nan,4])
>>> df
0
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 NaN
6 4.0
>>> df["b"] = df[0][::-1].expanding().min()[::-1]
>>> df
0 b
0 1.0 1.0
1 NaN 2.0
2 2.0 2.0
3 NaN 4.0
4 NaN 4.0
5 NaN 4.0
6 4.0 4.0
I've also tried something like the following in many different forms with little luck:
if df['is_third_friday'].any() == True:
df["monthly_exp"] = third_fridays[['expiration_date']]
else:
df["monthly_exp"] = third_fridays[['expiration_date']].shift(third_fridays)
Any suggestions to get me in the right direction would be appreciated. I've been stuck on this problem for sometime.
You could add these additional lines of code (to replace df["monthly_exp"] = third_fridays[['expiration_date']]:
# DataFrame of fridays from minimum expiration_date to 30 days after last
fri_3s = pd.DataFrame(pd.date_range(df["expiration_date"].min(),
df["expiration_date"].max()+pd.tseries.offsets.Day(30),
freq="W-FRI"),
columns=["monthly_exp"])
# only keep those that are between 15th and 21st (as your function did)
fri_3s = fri_3s[fri_3s.monthly_exp.dt.day.between(15, 21)]
# merge_asof to get next third friday
df = pd.merge_asof(df, fri_3s, left_on="expiration_date", right_on="monthly_exp", direction="forward")
This creates a second DataFrame with the 3rd Fridays, and then by merging with merge_asof returns the next of these from the expiration_date.
And to simplify your date_df["monthly_exp"][0][::-1].expanding().min()[::-1] and use it for datetime, you could instead write df["monthly_exp"].bfill() (which backward fills). As you mentioned, this will only include Fridays that exist in your DataFrame already, so creating a list of the possible Fridays might be the easiest way.
I am trying to loop through a dataframe creating a dynamic ranges that are limited to the last 6 months of every row index.
Because I am looking back 6 months, I start from the first index row that has a date >= the first date in row index 0 of the dataframe. The condition which I have managed to create is shown below:
for i in df.index:
if datetime.strptime(df['date'][i], '%Y-%m-%d %H:%M:%S') >= (datetime.strptime(df['date'].iloc[0], '%Y-%m-%d %H:%M:%S') + dateutil.relativedelta.relativedelta(months=6)):
However, this merely creates ranges that grow in size incorporating, all data that is indexed after
the first index row that has a date >= the first date in row index 0 of the dataframe.
How can I limit the condition statement to only the last 6 months of each row index?
I'm not sure what exactly you want to do once you have your "dynamic ranges".
You can obtain a list of intervals (t - 6mo, t) for each t in your DatetimeIndex):
intervals = [(t - pd.DateOffset(months=6), t) for t in df.index]
But doing selection operations in a big for-loop might be slow.
Instead, you might be interested in Pandas's rolling operations. It can even use a date offset (as long as it is fixed-frequency) instead of a fixed-sized int window width. However, "6 months" is a non-fixed frequency, and as such the regular rolling won't accept it.
Still, if you are ok with an approximation, say "182 days", then the following might work well.
# setup
n = 10
df = pd.DataFrame(
{'a': np.arange(n), 'b': np.ones(n)},
index=pd.date_range('2019-01-01', freq='M', periods=n))
# example: sum
df.rolling('182D', min_periods=0).sum()
# out:
a b
2019-01-31 0.0 1.0
2019-02-28 1.0 2.0
2019-03-31 3.0 3.0
2019-04-30 6.0 4.0
2019-05-31 10.0 5.0
2019-06-30 15.0 6.0
2019-07-31 21.0 7.0
2019-08-31 27.0 6.0
2019-09-30 33.0 6.0
2019-10-31 39.0 6.0
If you want to be strict on the 6 months windows, you can implement your own pandas.api.indexers.BaseIndexer and use that as arg of rolling.
I have a program that ideally measures the temperature every second. However, in reality this does not happen. Sometimes, it skips a second or it breaks down for 400 seconds and then decides to start recording again. This leaves gaps in my 2-by-n dataframe, where ideally n = 86400 (the amount of seconds in a day). I want to apply some sort of moving/rolling average to it to get a nicer plot, but if I do that to the "raw" datafiles, the amount of data points becomes less. This is shown here, watch the x-axis. I know the "nice data" doesn't look nice yet; I'm just playing with some values.
So, I want to implement a data cleaning method, which adds data to the dataframe. I thought about it, but don't know how to implement it. I thought of it as follows:
If the index is not equal to the time, then we need to add a number, at time = index. If this gap is only 1 value, then the average of the previous number and the next number will do for me. But if it is bigger, say 100 seconds are missing, then a linear function needs to be made, which will increase or decrease the value steadily.
So I guess a training set could be like this:
index time temp
0 0 20.10
1 1 20.20
2 2 20.20
3 4 20.10
4 100 22.30
Here, I would like to get a value for index 3, time 3 and the values missing between time = 4 and time = 100. I'm sorry about my formatting skills, I hope it is clear.
How would I go about programming this?
Use merge with complete time column and then interpolate:
# Create your table
time = np.array([e for e in np.arange(20) if np.random.uniform() > 0.6])
temp = np.random.uniform(20, 25, size=len(time))
temps = pd.DataFrame([time, temp]).T
temps.columns = ['time', 'temperature']
>>> temps
time temperature
0 4.0 21.662352
1 10.0 20.904659
2 15.0 20.345858
3 18.0 24.787389
4 19.0 20.719487
The above is a random table generated with missing time data.
# modify it
filled = pd.Series(np.arange(temps.iloc[0,0], temps.iloc[-1, 0]+1))
filled = filled.to_frame()
filled.columns = ['time'] # Create a fully filled time column
merged = pd.merge(filled, temps, on='time', how='left') # merge it with original, time without temperature will be null
merged.temperature = merged.temperature.interpolate() # fill nulls linearly.
# Alternatively, use reindex, this does the same thing.
final = temps.set_index('time').reindex(np.arange(temps.time.min(),temps.time.max()+1)).reset_index()
final.temperature = final.temperature.interpolate()
>>> merged # or final
time temperature
0 4.0 21.662352
1 5.0 21.536070
2 6.0 21.409788
3 7.0 21.283505
4 8.0 21.157223
5 9.0 21.030941
6 10.0 20.904659
7 11.0 20.792898
8 12.0 20.681138
9 13.0 20.569378
10 14.0 20.457618
11 15.0 20.345858
12 16.0 21.826368
13 17.0 23.306879
14 18.0 24.787389
15 19.0 20.719487
First you can set the second values to actual time values as such:
df.index = pd.to_datetime(df['time'], unit='s')
After which you can use pandas' built-in time series operations to resample and fill in the missing values:
df = df.resample('s').interpolate('time')
Optionally, if you still want to do some smoothing you can use the following operation for that:
df.rolling(5, center=True, win_type='hann').mean()
Which will smooth with a 5 element wide Hanning window. Note: any window-based smoothing will cost you value points at the edges.
Now your dataframe will have datetimes (including date) as index. This is required for the resample method. If you want to lose the date, you can simply use:
df.index = df.index.time
I have a dataframe, sega_df:
Month 2016-11-01 2016-12-01
Character
Sonic 12.0 3.0
Shadow 5.0 23.0
I would like to create multiple new columns, by applying a formula for each already existing column within my dataframe (to put it shortly, pretty much double the number of columns). That formula is (100 - [5*eachcell])*0.2.
For example, for November for Sonic, (100-[5*12.0])*0.2 = 8.0, and December for Sonic, (100-[5*3.0])*0.2 = 17.0 My ideal output is:
Month 2016-11-01 2016-12-01 Weighted_2016-11-01 Weighted_2016-12-01
Character
Sonic 12.0 3.0 8.0 17.0
Shadow 5.0 23.0 15.0 -3.0
I know how to create a for loop to create one column. This is for if only one month was in consideration:
for w in range(1,len(sega_df.index)):
sega_df['Weighted'] = (100 - 5*sega_df)*0.2
sega_df[sega_df < 0] = 0
I haven't gotten the skills or experience yet to create multiple columns. I've looked for other questions that may answer what exactly I am doing but haven't gotten anything to work yet. Thanks in advance.
One vectorised approach is to drown to numpy:
A = sega_df.values
A = (100 - 5*A) * 0.2
res = pd.DataFrame(A, index=sega_df.index, columns=('Weighted_'+sega_df.columns))
Then join the result to your original dataframe:
sega_df = sega_df.join(res)
Note: Contrived example. Please don't hate on forecasting and I don't need advice on it. This is strictly a Pandas how-to question.
Example - One Solution
I have two different sized DataFrames, one representing sales and one representing a forecast.
sales = pd.DataFrame({'sales':[5,3,5,6,4,4,5,6,7,5]})
forecast = pd.DataFrame({'forecast':[5,5.5,6,5]})
The forecast needs to be with the latest sales, which is at the end of the list of sales numbers [5, 6, 7, 5]. Other times, I might want it at other locations (please don't ask why, I just need it this way).
This works:
df = pd.concat([sales, forecast], ignore_index=True, axis=1)
df.columns = ['sales', 'forecast'] # Not necessary, making next command pretty
df.forecast = df.forecast.shift(len(sales) - len(forecast))
This gives me the desired outcome:
Question
What I want to know is: Can I concatenate to the end of the sales data without performing the additional shift (the last command)? I'd like to do this in one step instead of two. concat or something similar is fine, but I'd like to skip the shift.
I'm not hung up on having two lines of code. That's okay. I want a solution with the maximum possible performance. My application is sensitive to every millisecond we throw at it on account of huge volumes.
Not sure if that is much faster but you could do
sales = pd.DataFrame({'sales':[5,3,5,6,4,4,5,6,7,5]})
forecast = pd.DataFrame({'forecast':[5,5.5,6,5]})
forecast.index = sales.index[-forecast.shape[0]:]
which gives
forecast
6 5.0
7 5.5
8 6.0
9 5.0
and then simply
pd.concat([sales, forecast], axis=1)
yielding the desired outcome:
sales forecast
0 5 NaN
1 3 NaN
2 5 NaN
3 6 NaN
4 4 NaN
5 4 NaN
6 5 5.0
7 6 5.5
8 7 6.0
9 5 5.0
A one-line solution using the same idea, as mentioned by #Dark in the comments, would be:
pd.concat([sales, forecast.set_axis(sales.index[-len(forecast):], inplace=False)], axis=1)
giving the same output.