I have a dataframe:
A B C
date
2021-01-01 1 nan 1
2021-01-23 nan 1 1
2021-02-03 1 nan 1
How can I add "1" to all columns at the beginning of each month? (Note I also want to do this quarterly as well) The dataframe should end up looking like this:
A B C
date
2021-01-01 2 nan 2
2021-01-23 nan 1 1
2021-02-01 nan 1 1
2021-02-03 1 nan 1
The beginning of the month should have "nan" in the same place as the last instance of the previous month.
IIUC the logic, you could do:
# ensure datetime
df.index = pd.to_datetime(df.index)
# fill missing starts of month
idx = pd.date_range(df.index.min(), df.index.max(), freq='MS')
df = df.reindex(df.index.union(idx))
# update starts of month
prev = df.shift(1).loc[idx] # get last data of previous month
df.loc[idx] = df.loc[idx].add(1).combine_first(prev) # increment/fill
output:
A B C
2021-01-01 2.0 NaN 2.0
2021-01-23 NaN 1.0 1.0
2021-02-01 NaN 1.0 1.0
2021-02-03 1.0 NaN 1.0
df[(df.index.is_month_end) & (df.index >= df.first_valid_index())]+=1
Figured it out. And, for quarterly, it would be .is_quarter_end
Related
I have this data frame
import pandas as pd
df = pd.DataFrame({'COTA':['A','A','A','A','A','B','B','B','B'],
'Date':['14/10/2021','19/10/2020','29/10/2019','30/09/2021','20/09/2020','20/10/2021','29/10/2020','15/10/2019','10/09/2020'],
'Mark':[1,2,3,4,5,1,2,3,3]
})
print(df)
based on this data frame I wanted the MARK from the previous year, I managed to acquire the maximum COTA but I wanted the last one, I used .max() and I thought I could get it with .last() but it didn't work.
follow the example of my code.
df['Date'] = pd.to_datetime(df['Date'])
df['LastYear'] = df['Date'] - pd.offsets.YearEnd(0)
s1 = df.groupby(['Found', 'LastYear'])['Mark'].max()
s2 = s1.rename(index=lambda x: x + pd.offsets.DateOffset(years=1), level=1)
df = df.join(s2.rename('Max_MarkLastYear'), on=['Found', 'LastYear'])
print (df)
Found Date Mark LastYear Max_MarkLastYear
0 A 2021-10-14 1 2021-12-31 5.0
1 A 2020-10-19 2 2020-12-31 3.0
2 A 2019-10-29 3 2019-12-31 NaN
3 A 2021-09-30 4 2021-12-31 5.0
4 A 2020-09-20 5 2020-12-31 3.0
5 B 2021-10-20 1 2021-12-31 3.0
6 B 2020-10-29 2 2020-12-31 3.0
7 B 2019-10-15 3 2019-12-31 NaN
8 B 2020-10-09 3 2020-12-31 3.0
How do I create a new column with the last value of the previous year
I would like to compare one column of a df with another column in a different df. The columns are timestamp and holiday date. I'd like to create a dummy variable wherein if the timestamp in df1 match the dates in df2 = 1, else 0.
For example, df1:
timestamp weight(kg)
0 2016-03-04 4.0
1 2015-02-15 5.0
2 2019-05-04 5.0
3 2018-12-25 29.0
4 2020-01-01 58.0
For example, df2:
holiday
0 2016-12-25
1 2017-01-01
2 2019-05-01
3 2018-12-26
4 2020-05-26
Ideal output:
timestamp weight(kg) holiday
0 2016-03-04 4.0 0
1 2015-02-15 5.0 0
2 2019-05-04 5.0 0
3 2018-12-25 29.0 1
4 2020-01-01 58.0 1
I have tried writing a function but it is taking very long to calculate:
def add_holiday(x):
hols_df = hols.apply(lambda y: y['holiday_dt'] if
x['timestamp'] == y['holiday_dt']
else None, axis=1)
hols_df = hols_df.dropna(axis=0, how='all')
if hols_df.empty:
hols_df= np.nan
else:
hols_df= hols_df.to_string(index=False)
return hols_df
#df_hols['holidays'] = df_hols.apply(add_holiday, axis=1)
Perhaps, there is a simpler way to do so or the function is not exactly well-written. Any help will be appreciated.
Use Series.isin with convert mask to 1,0 by Series.astype:
df1['holiday'] = df1['timestamp'].isin(df2['holiday']).astype(int)
Or with numpy.where:
df1['holiday'] = np.where(df1['timestamp'].isin(df2['holiday']), 1, 0)
This is my data:
df = pd.DataFrame([
{start_date: '2019/12/01', end_date: '2019/12/05', spend: 10000, campaign_id: 1}
{start_date: '2019/12/05', end_date: '2019/12/09', spend: 50000, campaign_id: 2}
{start_date: '2019/12/01', end_date: '', spend: 10000, campaign_id: 3}
{start_date: '2019/12/01', end_date: '2019/12/01', spend: 50, campaign_id: 4}
]);
I need to add a column to each row for each day since 2019/12/01, and calculate the spend on that campaign that day, which I'll get by dividing the spend on the campaign by the total number of days it was active.
So here I'd add a column for each day between 1 December and today (10 December). For row 1, the content of the five columns for 1 Dec to 5 Dec would be 2000, then for the six ocolumns from 5 Dec to 10 Dec it would be zero.
I know pandas is well-designed for this kind of problem, but I have no idea where to start!
Doesn't seem like a straight forward task to me. But first convert your date columns if you haven't already:
df["start_date"] = pd.to_datetime(df["start_date"])
df["end_date"] = pd.to_datetime(df["end_date"])
Then create a helper function for resampling:
def resampler(data, daterange):
temp = (data.set_index('start_date').groupby('campaign_id')
.apply(daterange)
.drop("campaign_id",axis=1)
.reset_index().rename(columns={"level_1":"start_date"}))
return temp
Now its a 3 step process. First resample your data according to end_date of each group:
df1 = resampler(df, lambda d: d.reindex(pd.date_range(min(d.index),max(d["end_date"]),freq="D")) if d["end_date"].notnull().all() else d)
df1["spend"] = df1.groupby("campaign_id")["spend"].transform(lambda x: x.mean()/len(x))
With the average values calculated, resample again to current date:
dates = pd.date_range(min(df["start_date"]),pd.Timestamp.today(),freq="D")
df1 = resampler(df1,lambda d: d.reindex(dates))
Finally transpose your dataframe:
df1 = pd.concat([df1.drop("end_date",axis=1).set_index(["campaign_id","start_date"]).unstack(),
df1.groupby("campaign_id")["end_date"].min()], axis=1)
df1.columns = [*dates,"end_date"]
print (df1)
#
2019-12-01 00:00:00 2019-12-02 00:00:00 2019-12-03 00:00:00 2019-12-04 00:00:00 2019-12-05 00:00:00 2019-12-06 00:00:00 2019-12-07 00:00:00 2019-12-08 00:00:00 2019-12-09 00:00:00 2019-12-10 00:00:00 end_date
campaign_id
1 2000.0 2000.0 2000.0 2000.0 2000.0 NaN NaN NaN NaN NaN 2019-12-05
2 NaN NaN NaN NaN 10000.0 10000.0 10000.0 10000.0 10000.0 NaN 2019-12-09
3 10000.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT
4 50.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2019-12-01
I have a dataframe full of bookings for one room (rows: booking_id, check-in date and check-out date that I want to transform into a timeseries indexed by all year days (index: days of year, feature: booked or not).
I have calculated the duration of the bookings, and reindexed the dataframe daily.
Now I need to forward-fill the dataframe, but only a limited number of times: the duration of each booking.
Tried iterating through each row with ffill but it applies to the entire dataframe, not to selected rows.
Any idea how I can do that?
Here is my code:
import numpy as np
import pandas as pd
#create dataframe
data=[[1, '2019-01-01', '2019-01-02', 1],
[2, '2019-01-03', '2019-01-07', 4],
[3, '2019-01-10','2019-01-13', 3]]
df = pd.DataFrame(data, columns=['booking_id', 'check-in', 'check-out', 'duration'])
#cast dates to datetime formats
df['check-in'] = pd.to_datetime(df['check-in'])
df['check-out'] = pd.to_datetime(df['check-out'])
#create timeseries indexed on check-in date
df2 = df.set_index('check-in')
#create new index and reindex timeseries
idx = pd.date_range(min(df['check-in']), max(df['check-out']), freq='D')
ts = df2.reindex(idx)
I have this:
booking_id check-out duration
2019-01-01 1.0 2019-01-02 1.0
2019-01-02 NaN NaT NaN
2019-01-03 2.0 2019-01-07 4.0
2019-01-04 NaN NaT NaN
2019-01-05 NaN NaT NaN
2019-01-06 NaN NaT NaN
2019-01-07 NaN NaT NaN
2019-01-08 NaN NaT NaN
2019-01-09 NaN NaT NaN
2019-01-10 3.0 2019-01-13 3.0
2019-01-11 NaN NaT NaN
2019-01-12 NaN NaT NaN
2019-01-13 NaN NaT NaN
I expect to have:
booking_id check-out duration
2019-01-01 1.0 2019-01-02 1.0
2019-01-02 1.0 2019-01-02 1.0
2019-01-03 2.0 2019-01-07 4.0
2019-01-04 2.0 2019-01-07 4.0
2019-01-05 2.0 2019-01-07 4.0
2019-01-06 2.0 2019-01-07 4.0
2019-01-07 NaN NaT NaN
2019-01-08 NaN NaT NaN
2019-01-09 NaN NaT NaN
2019-01-10 3.0 2019-01-13 3.0
2019-01-11 3.0 2019-01-13 3.0
2019-01-12 3.0 2019-01-13 3.0
2019-01-13 NaN NaT NaN
filluntil = ts['check-out'].ffill()
m = ts.index < filluntil.values
#reshaping the mask to be shame shape as ts
m = np.repeat(m, ts.shape[1]).reshape(ts.shape)
ts = ts.ffill().where(m)
First we create a series where the dates are ffilled. Then we create a mask where the index is less than the filled values. Then we fill based on our mask.
If you want to include the row with the check out date, change m from < to <=
I think to "forward-fill the dataframe" you should use pandas interpolate method. Documentation can be found here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html
you can do something like this:
int_how_many_consecutive_to_fill = 3
df2 = df2.interpolate(axis=0, limit=int_how_many_consecutive_to_fill, limit_direction='forward')
look at the specific documentation for interpolate, there is a lot of custom functionality you can add with flags to the method.
EDIT:
to do this using the row value in the duration column for each interpolation, this is a bit messy but I think it should work (there may be a less hacky, cleaner solution using some functionality in pandas or another library i am unaware of):
#get rows with nans in them:
nans_df = df2[df2.isnull()]
#get rows without nans in them:
non_nans_df = df2[~df2.isnull()]
#list of dfs we will concat vertically at the end to get final dataframe.
dfs = []
#iterate through each row that contains NaNs.
for nan_index, nan_row in nans_df.iterrows():
previous_day = nan_index - pd.DateOffset(1)
#this checks if the previous day to this NaN row is a day where we have non nan values, if the previous day is a nan day just skip this loop. This is mostly here to handle the case where the first row is a NaN one.
if previous_day not in non_nans_df.index:
continue
date_offset = 0
#here we are checking how many sequential rows there are after this one with all nan values in it, this will be stored in the date_offset variable.
while (nan_index + pd.DateOffset(date_offset)) in nans_df.index:
date_offset += 1
#this gets us the last date in the sequence of continuous days with all nan values after this current one.
end_sequence_date = nan_index + pd.DateOffset(date_offset)
#this gives us a dataframe where the first row in it is the previous day to this one(nan_index), confirmed to be non NaN by the first if statement in this for loop. It then combines this non NaN row with all the sequential nan rows after it into the variable df_to_interpolate.
df_to_interpolate = non_nans_df.iloc[previous_day].append(nans_df.iloc[nan_index:end_sequence_date])
# now we pull the duration value for the first row in our df_to_interpolate dataframe.
limit_val = int(df_to_interpolate['duration'][0])
#here we interpolate the dataframe using the limit_val
df_to_interpolate = df_to_interpolate.interpolate(axis=0, limit=limit_val, limit_direction='forward')
#append df_to_interpolate to our list that gets combined at the end.
dfs.append(df_to_interpolate)
#gives us our final dataframe, interpolated forward using a dynamic limit value based on the most recent duration value.
final_df = pd.concat(dfs)
Here is current df:
ID Date
1 3/29/2017
2
3 11/5/2015
4
5 2/28/2017
I am trying to get year + month as a string in the new column. And this is my code:
df["Year"] = df["Date"].dt.year
df["Month"] = df["Date"].dt.month
df["yyyy_mm"] = df["Year"].map(str) + "-" + df["Month"].map(str)
The issue is when I extract the year and month from the date, it will return the float type.
ID Date Year Month yyyy_mm I hope to get this
1 3/29/2017 2017.0 3.0 2017.0-3.0 2017-3
2 nan-nan
3 11/5/2015 2015.0 11.0 2015.0-11.0 2015-11
4 nan-nan
5 2/28/2017 2017.0 2.0 2017.0-2.0 2017-2
I tried to use df["Date"].dt.year.astype(int) to convert it to int, so that there is no .0, but I got this error: Cannot convert non-finite values (NA or inf) to integer. Because there NAN in column.
I don't want to fillna for all the year and month with 0 or something else, i just want to keep them empty since date is empty at that row.
You should perform string conversion directly from Date using pd.Series.dt.strftime.
This not only ensures NaT rows remain NaT, but strings are better formatted, e.g. zero-padding for months.
df["yyyy_mm"] = df['Date'].dt.strftime('%Y-%m')
print(df)
ID Date Year Month yyyy_mm
0 1 2017-03-29 2017.0 3.0 2017-03
1 2 NaT NaN NaN NaT
2 3 2015-11-05 2015.0 11.0 2015-11
3 4 NaT NaN NaN NaT
4 5 2017-02-28 2017.0 2.0 2017-02