I generated a new column displaying the 7th business day of the year and month from this df:
YearOfSRC MonthNumberOfSRC
0 2022 3
1 2022 4
2 2022 5
3 2022 6
4 2021 4
... ... ... ...
20528 2022 1
20529 2022 2
20530 2022 3
20531 2022 4
20532 2022 5
With this code:
df['PredictionDate'] = (pd
.to_datetime(df[['YearOfSRC', 'MonthNumberOfSRC']]
.set_axis(['year' ,'month'], axis=1)
.assign(day=1)
)
.sub(pd.offsets.BusinessDay(1))
.add(pd.offsets.BusinessDay(7))
)
To output this dataframe (df_final) with the new column, PredictionDate:
YearOfSRC MonthNumberOfSRC PredictionDate
0 2022 3 2022-03-09
1 2022 4 2022-04-11
2 2022 5 2022-05-10
3 2022 6 2022-06-09
4 2021 4 2021-04-09
... ... ... ...
20528 2022 1 2022-01-11
20529 2022 2 2022-02-09
20530 2022 3 2022-03-09
20531 2022 4 2022-04-11
20532 2022 5 2022-05-10
(More details here)
However, I would like to make use of CustomBusinessDay and Python's Holiday package to modify the rows of PredictionDate where a holiday in the first week would push back the 7th business day by 1 business day. I know that CustomBusinessDay has a parameter for a holiday list so in a modular solution I would assign the list from the holiday library to that parameter. I know I could hard-code the added business day by increasing the day by 1 for all months where there is a holiday in the first week, but I would prefer a solution that is more dynamic. I have tried this instead of the above code but I get a KeyError:
df_final['PredictionDate'] = (pd
.to_datetime(df_final[['YearOfSRC', 'MonthNumberOfSRC']]
.set_axis(['year' ,'month'], axis=1)
.assign(day=1)
)
.sub(df_final.apply(lambda x : pd.offsets.CustomBusinessDay(1, holidays = holidays.US(years = x['YearOfSRC']).keys())))
.add(df_final.apply(lambda x: pd.offsets.CustomBusinessDay(7, holidays = holidays.US(years = x['YearOfSRC']).keys())))
)
KeyError: 'YearOfSRC'
I'm sure I am implementing pandas apply and lambda functions incorrectly here, but I don't know why the error would be a key error when that's clearly a column in df_final.
Per my comment above, try this:
df_final['PredictionDate'] = (pd
.to_datetime(df_final[['YearOfSRC', 'MonthNumberOfSRC']]
.set_axis(['year' ,'month'], axis=1)
.assign(day=1)
)
.sub(df_final.apply(lambda x : pd.offsets.CustomBusinessDay(1, holidays = holidays.US(year = x['YearOfSRC']).items()), axis=1))
.add(df_final.apply(lambda x: pd.offsets.CustomBusinessDay(7, holidays = holidays.US(year = x['YearOfSRC']).items()), axis=1))
)
Related
I have a table which looks like this:
ID
Start Date
End Date
1
01/01/2022
29/01/2022
2
03/01/2022
3
15/01/2022
4
01/02/2022
01/03/2022
5
01/03/2022
01/05/2022
6
01/04/2022
So, for every row i have the start date of the contract with the user and the end date. If the contract is still present, there will be no end date.
I'm trying to get a table that looks like this:
Feb
Mar
Apr
Jun
3
3
4
3
Which counts the number of active users on the first day of the month.
What is the most efficient way to calculate this?
At the moment the only idea that came to my mind was to use a scaffold table containing the dates i'm intereseted in (the first day of every month) and from that easily create the new table I need.
But my question is, is there a better way to solve this? I would love to find a more efficient way to calculate this since i would need to repeat the exact same calculations for the number of users at the start of the week.
This might help:
# initializing dataframe
df = pd.DataFrame({'start':['01/01/2022','03/01/2022','15/01/2022','01/02/2022','01/03/2022','01/04/2022'],
'end':['29/01/2022','','','01/03/2022','01/05/2022','']})
# cleaning datetime (the empty ones are replaced with the max exit)
df['start'] = pd.to_datetime(df['start'],format='%d/%m/%Y')
df['end'] = pd.to_datetime(df['end'],format='%d/%m/%Y', errors='coerce')
df['end'].fillna(df.end.max(), inplace=True)
dt_range = pd.date_range(start=df.start.min(),end=df.end.max(),freq='MS')
df2 = pd.DataFrame(columns=['month','number'])
for dat in dt_range:
row = {'month':dat.strftime('%B - %Y'),'number':len(df[(df.start <= dat)&(df.end >= dat)])}
df2 = df2.append(row, ignore_index=True)
Output:
month number
0 January - 2022 1
1 February - 2022 3
2 March - 2022 4
3 April - 2022 4
4 May - 2022 4
Or, if you want the format as in your question:
df2.T
month January - 2022 February - 2022 March - 2022 April - 2022 May - 2022
number 1 3 4 4 4
My code is returning the following data in CSV
Quantity Date of purchase
1 17 May 2022 at 5:40:20PM BST
1 2 Apr 2022 at 7:41:29PM BST
1 2 Apr 2022 at 6:42:05PM BST
1 29 Mar 2022 at 12:34:56PM BST
1 29 Mar 2022 at 10:52:54AM BST
1 29 Mar 2022 at 12:04:52AM BST
1 28 Mar 2022 at 4:49:34PM BST
1 28 Mar 2022 at 11:13:37AM BST
1 27 Mar 2022 at 8:53:05PM BST
1 27 Mar 2022 at 5:10:21PM BST
I am trying to get the dates only and adding the quantity data with the same date but below is the code for that
data = read_csv("products_sold_history_data.csv")
data['Date of purchase'] = pandas.to_datetime(data['Date of purchase'] , format='%d-%m-%Y').dt.date
but its giving me error can anyone please help how can I take the dates only from Date of purchase column and then add the quantity values in the same date.
Date format in your data is not the format that you specified: format='%d-%m-%Y'.
You could specify it explicitly, or let pandas infer the format for you by not providing the format:
pandas.to_datetime(data['Date of purchase']).dt.date
If you want to specify the format explicitly, you should provide the format that matches your data:
pandas.to_datetime(data['Date of purchase'], format='%d %b %Y at %H:%M:%S%p %Z')
here is one way to do it, where a date is created as a on-fly field and not making part of the DF.
Also, IIUC you're not concerned with the time part and only date is what you need to use for summing it up
extract the date part using regex, create a temp field dte using pandas.assign, and then a groupby to sum up the quantity
df.assign(dte = pd.to_datetime(
df['purchase'].str.extract(r'(.*)(at)')[0].str.strip())
).groupby('dte')['qty'].sum().reset_index()
dte qty
0 2022-02-06 3
1 2022-02-07 3
2 2022-02-08 2
3 2022-02-09 2
4 2022-02-10 2
5 2022-02-11 3
6 2022-02-14 1
7 2022-02-15 1
8 2022-02-19 1
I have created this DataFrame:
agency coupon vintage Cbal Month CPR year month Month_Predicted_DT
0 FHLG 1.5 2021 70.090310 November 5.418937 2022 11 2022-11-01
1 FHLG 1.5 2021 70.090310 December 5.549916 2022 12 2022-12-01
2 FHLG 1.5 2021 70.090310 January 5.238943 2022 1 2022-01-01
3 FHLG 1.5 2020 52.414637 November 5.514456 2022 11 2022-11-01
4 FHLG 1.5 2020 52.414637 December 5.550490 2022 12 2022-12-01
5 FHLG 1.5 2020 52.414637 January 5.182304 2022 1 2022-01-01
Created from this original df:
agency coupon year Cbal November December January
0 FHLG 1.5 2021 70.090310 5.418937 5.549916 5.238943
1 FHLG 1.5 2020 52.414637 5.514456 5.550490 5.182304
2 FHLG 2.0 2022 44.598755 3.346706 3.715995 3.902644
3 FHLG 2.0 2021 472.209165 5.802857 5.899596 5.627774
4 FHLG 2.0 2020 269.761452 7.090993 7.091404 6.567561
Using this code:
citi = pd.read_excel("Downloads/CITI_2022_05_22(5_22).xlsx")
#Extracting just the relevant months (M, M+1, M+2)
M = citi.columns[-6]
M_1 = citi.columns[-4]
M_2 = citi.columns[-2]
#Extracting just the relevant columns
cols = ['agency-term','coupon','year','Cbal',M,M_1,M_2]
citi = citi[cols]
todays_date = date.today()
current_year = todays_date.year
citi_new['year'] = current_year
citi_new['month'] = pd.to_datetime(citi_new.Month, format="%B").dt.month
citi_new['Month_Predicted_DT'] = pd.to_datetime(citi_new[['year', 'month']].assign(DAY=1))
citi_new = citi.set_index(cols[0:4]).stack().reset_index()
citi_new.rename(columns={"level_4": "Month", 0 : "CPR", "year" : "vintage"}, inplace = True)
For reference M is the current month, and M_1 and M_2 are month+1 and month+2.
My main question is that my solution for creating the 'Month_Predicted_DT column only works if the months in question do not overlap with the new year, so if M == November or M == December, then the year in Month_Predicted_DT is not correct for January and/or February. For example, Month_Predicted_DT for January rows should be 2023-01-01 not 2022. The same would be true if M was December, then I would want rows for Jan. and Feb. to be 2023-01-01 and 2023-02-01, respectively.
I have tried to come up with a workaround using df.iterrows or np.where but just can't really get a working solution.
You could try adding 12 months to dates that are over two months out:
#get first day of the current month
start = pd.Timestamp.today().normalize().replace(day=1)
#convert month column to timestamps
dates = pd.to_datetime(df["Month"]+f"{start.year}", format="%B%Y")
#offset the year if the date is not in the next 3 months
df["Month_Predicted_DT"] = dates.where(dates>=start,dates+pd.DateOffset(months=12))
I have a dataframe df1:
Month
1
3
March
April
2
4
5
I have another dataframe df2:
Month Name
1 January
2 February
3 March
4 April
5 May
If I want to replace the integer values of df1 with the corresponding name from df2, what kind of lookup function can I use?
I want to end up with this as my df1:
Month
January
March
March
April
February
May
replace it
df1.replace(dict(zip(df2.Month.astype(str),df2.Name)))
Out[76]:
Month
0 January
1 March
2 March
3 April
4 February
5 April
6 May
You can use pd.Series.map and then fillna. Just be careful to map either strings to strings or, as here, numeric to numeric:
month_name = df2.set_index('Month')['Name']
df1['Month'] = pd.to_numeric(df1['Month'], errors='coerce').map(month_name)\
.fillna(df1['Month'])
print(df1)
Month
0 January
1 March
2 March
3 April
4 February
5 April
6 May
You can also use pd.Series.replace, but this is often inefficient.
One alternative is to use map with a function:
def repl(x, lookup=dict(zip(df2.Month.astype(str), df2.Name))):
return lookup.get(x, x)
df['Month'] = df['Month'].map(repl)
print(df)
Output
Month
0 January
1 February
2 March
3 April
4 May
Use map with a series, just need to make sure your dtypes match:
mapper = df2.set_index(df2['Month'].astype(str))['Name']
df1['Month'].map(mapper).fillna(df1['Month'])
Output:
0 January
1 March
2 March
3 April
4 February
5 April
6 May
Name: Month, dtype: object
I am still quite new to Python, so please excuse my basic question.
After a reset of pandas grouped dataframe, I get the following:
year month pl
0 2010 1 27.4376
1 2010 2 29.2314
2 2010 3 33.5714
3 2010 4 37.2986
4 2010 5 36.6971
5 2010 6 35.9329
I would like to merge year and month to one column in pandas datetime format.
I am trying:
C3['date']=pandas.to_datetime(C3.year + C3.month, format='%Y-%m')
But it gives me a date like this:
year month pl date
0 2010 1 27.4376 1970-01-01 00:00:00.000002011
What is the correct way? Thank you.
You need to convert to str if necessary, then zfill the month col and pass this with a valid format to to_datetime:
In [303]:
df['date'] = pd.to_datetime(df['year'].astype(str) + df['month'].astype(str).str.zfill(2), format='%Y%m')
df
Out[303]:
year month pl date
0 2010 1 27.4376 2010-01-01
1 2010 2 29.2314 2010-02-01
2 2010 3 33.5714 2010-03-01
3 2010 4 37.2986 2010-04-01
4 2010 5 36.6971 2010-05-01
5 2010 6 35.9329 2010-06-01
If the conversion is unnecessary then the following should work:
df['date'] = pd.to_datetime(df['year'] + df['month'].str.zfill(2), format='%Y%m')
Your attempt failed as it treated the value as epoch time:
In [305]:
pd.to_datetime(20101, format='%Y-%m')
Out[305]:
Timestamp('1970-01-01 00:00:00.000020101')