I have a dataframe df1:
Month
1
3
March
April
2
4
5
I have another dataframe df2:
Month Name
1 January
2 February
3 March
4 April
5 May
If I want to replace the integer values of df1 with the corresponding name from df2, what kind of lookup function can I use?
I want to end up with this as my df1:
Month
January
March
March
April
February
May
replace it
df1.replace(dict(zip(df2.Month.astype(str),df2.Name)))
Out[76]:
Month
0 January
1 March
2 March
3 April
4 February
5 April
6 May
You can use pd.Series.map and then fillna. Just be careful to map either strings to strings or, as here, numeric to numeric:
month_name = df2.set_index('Month')['Name']
df1['Month'] = pd.to_numeric(df1['Month'], errors='coerce').map(month_name)\
.fillna(df1['Month'])
print(df1)
Month
0 January
1 March
2 March
3 April
4 February
5 April
6 May
You can also use pd.Series.replace, but this is often inefficient.
One alternative is to use map with a function:
def repl(x, lookup=dict(zip(df2.Month.astype(str), df2.Name))):
return lookup.get(x, x)
df['Month'] = df['Month'].map(repl)
print(df)
Output
Month
0 January
1 February
2 March
3 April
4 May
Use map with a series, just need to make sure your dtypes match:
mapper = df2.set_index(df2['Month'].astype(str))['Name']
df1['Month'].map(mapper).fillna(df1['Month'])
Output:
0 January
1 March
2 March
3 April
4 February
5 April
6 May
Name: Month, dtype: object
Related
I have a table which looks like this:
ID
Start Date
End Date
1
01/01/2022
29/01/2022
2
03/01/2022
3
15/01/2022
4
01/02/2022
01/03/2022
5
01/03/2022
01/05/2022
6
01/04/2022
So, for every row i have the start date of the contract with the user and the end date. If the contract is still present, there will be no end date.
I'm trying to get a table that looks like this:
Feb
Mar
Apr
Jun
3
3
4
3
Which counts the number of active users on the first day of the month.
What is the most efficient way to calculate this?
At the moment the only idea that came to my mind was to use a scaffold table containing the dates i'm intereseted in (the first day of every month) and from that easily create the new table I need.
But my question is, is there a better way to solve this? I would love to find a more efficient way to calculate this since i would need to repeat the exact same calculations for the number of users at the start of the week.
This might help:
# initializing dataframe
df = pd.DataFrame({'start':['01/01/2022','03/01/2022','15/01/2022','01/02/2022','01/03/2022','01/04/2022'],
'end':['29/01/2022','','','01/03/2022','01/05/2022','']})
# cleaning datetime (the empty ones are replaced with the max exit)
df['start'] = pd.to_datetime(df['start'],format='%d/%m/%Y')
df['end'] = pd.to_datetime(df['end'],format='%d/%m/%Y', errors='coerce')
df['end'].fillna(df.end.max(), inplace=True)
dt_range = pd.date_range(start=df.start.min(),end=df.end.max(),freq='MS')
df2 = pd.DataFrame(columns=['month','number'])
for dat in dt_range:
row = {'month':dat.strftime('%B - %Y'),'number':len(df[(df.start <= dat)&(df.end >= dat)])}
df2 = df2.append(row, ignore_index=True)
Output:
month number
0 January - 2022 1
1 February - 2022 3
2 March - 2022 4
3 April - 2022 4
4 May - 2022 4
Or, if you want the format as in your question:
df2.T
month January - 2022 February - 2022 March - 2022 April - 2022 May - 2022
number 1 3 4 4 4
I generated a new column displaying the 7th business day of the year and month from this df:
YearOfSRC MonthNumberOfSRC
0 2022 3
1 2022 4
2 2022 5
3 2022 6
4 2021 4
... ... ... ...
20528 2022 1
20529 2022 2
20530 2022 3
20531 2022 4
20532 2022 5
With this code:
df['PredictionDate'] = (pd
.to_datetime(df[['YearOfSRC', 'MonthNumberOfSRC']]
.set_axis(['year' ,'month'], axis=1)
.assign(day=1)
)
.sub(pd.offsets.BusinessDay(1))
.add(pd.offsets.BusinessDay(7))
)
To output this dataframe (df_final) with the new column, PredictionDate:
YearOfSRC MonthNumberOfSRC PredictionDate
0 2022 3 2022-03-09
1 2022 4 2022-04-11
2 2022 5 2022-05-10
3 2022 6 2022-06-09
4 2021 4 2021-04-09
... ... ... ...
20528 2022 1 2022-01-11
20529 2022 2 2022-02-09
20530 2022 3 2022-03-09
20531 2022 4 2022-04-11
20532 2022 5 2022-05-10
(More details here)
However, I would like to make use of CustomBusinessDay and Python's Holiday package to modify the rows of PredictionDate where a holiday in the first week would push back the 7th business day by 1 business day. I know that CustomBusinessDay has a parameter for a holiday list so in a modular solution I would assign the list from the holiday library to that parameter. I know I could hard-code the added business day by increasing the day by 1 for all months where there is a holiday in the first week, but I would prefer a solution that is more dynamic. I have tried this instead of the above code but I get a KeyError:
df_final['PredictionDate'] = (pd
.to_datetime(df_final[['YearOfSRC', 'MonthNumberOfSRC']]
.set_axis(['year' ,'month'], axis=1)
.assign(day=1)
)
.sub(df_final.apply(lambda x : pd.offsets.CustomBusinessDay(1, holidays = holidays.US(years = x['YearOfSRC']).keys())))
.add(df_final.apply(lambda x: pd.offsets.CustomBusinessDay(7, holidays = holidays.US(years = x['YearOfSRC']).keys())))
)
KeyError: 'YearOfSRC'
I'm sure I am implementing pandas apply and lambda functions incorrectly here, but I don't know why the error would be a key error when that's clearly a column in df_final.
Per my comment above, try this:
df_final['PredictionDate'] = (pd
.to_datetime(df_final[['YearOfSRC', 'MonthNumberOfSRC']]
.set_axis(['year' ,'month'], axis=1)
.assign(day=1)
)
.sub(df_final.apply(lambda x : pd.offsets.CustomBusinessDay(1, holidays = holidays.US(year = x['YearOfSRC']).items()), axis=1))
.add(df_final.apply(lambda x: pd.offsets.CustomBusinessDay(7, holidays = holidays.US(year = x['YearOfSRC']).items()), axis=1))
)
I have created this DataFrame:
agency coupon vintage Cbal Month CPR year month Month_Predicted_DT
0 FHLG 1.5 2021 70.090310 November 5.418937 2022 11 2022-11-01
1 FHLG 1.5 2021 70.090310 December 5.549916 2022 12 2022-12-01
2 FHLG 1.5 2021 70.090310 January 5.238943 2022 1 2022-01-01
3 FHLG 1.5 2020 52.414637 November 5.514456 2022 11 2022-11-01
4 FHLG 1.5 2020 52.414637 December 5.550490 2022 12 2022-12-01
5 FHLG 1.5 2020 52.414637 January 5.182304 2022 1 2022-01-01
Created from this original df:
agency coupon year Cbal November December January
0 FHLG 1.5 2021 70.090310 5.418937 5.549916 5.238943
1 FHLG 1.5 2020 52.414637 5.514456 5.550490 5.182304
2 FHLG 2.0 2022 44.598755 3.346706 3.715995 3.902644
3 FHLG 2.0 2021 472.209165 5.802857 5.899596 5.627774
4 FHLG 2.0 2020 269.761452 7.090993 7.091404 6.567561
Using this code:
citi = pd.read_excel("Downloads/CITI_2022_05_22(5_22).xlsx")
#Extracting just the relevant months (M, M+1, M+2)
M = citi.columns[-6]
M_1 = citi.columns[-4]
M_2 = citi.columns[-2]
#Extracting just the relevant columns
cols = ['agency-term','coupon','year','Cbal',M,M_1,M_2]
citi = citi[cols]
todays_date = date.today()
current_year = todays_date.year
citi_new['year'] = current_year
citi_new['month'] = pd.to_datetime(citi_new.Month, format="%B").dt.month
citi_new['Month_Predicted_DT'] = pd.to_datetime(citi_new[['year', 'month']].assign(DAY=1))
citi_new = citi.set_index(cols[0:4]).stack().reset_index()
citi_new.rename(columns={"level_4": "Month", 0 : "CPR", "year" : "vintage"}, inplace = True)
For reference M is the current month, and M_1 and M_2 are month+1 and month+2.
My main question is that my solution for creating the 'Month_Predicted_DT column only works if the months in question do not overlap with the new year, so if M == November or M == December, then the year in Month_Predicted_DT is not correct for January and/or February. For example, Month_Predicted_DT for January rows should be 2023-01-01 not 2022. The same would be true if M was December, then I would want rows for Jan. and Feb. to be 2023-01-01 and 2023-02-01, respectively.
I have tried to come up with a workaround using df.iterrows or np.where but just can't really get a working solution.
You could try adding 12 months to dates that are over two months out:
#get first day of the current month
start = pd.Timestamp.today().normalize().replace(day=1)
#convert month column to timestamps
dates = pd.to_datetime(df["Month"]+f"{start.year}", format="%B%Y")
#offset the year if the date is not in the next 3 months
df["Month_Predicted_DT"] = dates.where(dates>=start,dates+pd.DateOffset(months=12))
The start year, start month, end year, and end month are the inputs (like May'2022 to June'2024). If I need to calculate how many definite months are included in this period (like how many January, March, or December are in this period), how can I achieve this using Python?
Use date_range with DatetimeIndex.month_name and Index.value_counts:
s = pd.date_range('2022-05-01','2024-06-01', freq='MS').month_name().value_counts()
print (s)
June 3
May 3
April 2
March 2
July 2
December 2
November 2
October 2
February 2
January 2
September 2
August 2
dtype: int64
Last select by index in Series called s:
print (s['January'])
2
print (s['March'])
2
pandas date_range is a good helper here
import pandas as pd
ym = pd.date_range('2022-05-01','2024-06-01', freq='MS').strftime("%Y-%b").to_list()
print(ym)
def count_month(ym_list, month):
return(sum(month in s for s in ym_list))
print(count_month(ym, "May"))
print(count_month(ym, "Jan"))
and the output is
['2022-May', '2022-Jun', '2022-Jul', '2022-Aug', '2022-Sep', '2022-Oct', '2022-Nov', '2022-Dec', '2023-Jan', '2023-Feb', '2023-Mar', '2023-Apr', '2023-May', '2023-Jun', '2023-Jul', '2023-Aug', '2023-Sep', '2023-Oct', '2023-Nov', '2023-Dec', '2024-Jan', '2024-Feb', '2024-Mar', '2024-Apr', '2024-May', '2024-Jun']
3
2
I have the following dataframe :
month price
0 April 102.478015
1 August 94.868053
2 December 97.278205
3 February 100.114510
4 January 99.419109
5 July 93.402928
6 June 96.114224
7 March 101.297762
8 May 102.905340
9 November 97.952169
10 October 95.606478
11 September 94.226803
I would like to have the months in a coherent order (January in the first row until December in the 12th row). How please could I do ?
If necessary, you can copy this dataframe and then execute
pd.read_clipboard(sep='\s\s+')
to have the dataframe on your jupyter notebook
Convert values to ordered categoricals, so possible use DataFrame.sort_values:
cats = ['January','February','March','April','May','June',
'July','August','September','October','November','December']
df['month'] = pd.CategoricalIndex(df['month'], ordered=True, categories=cats)
#alternative
#df['month'] = pd.Categorical(df['month'], ordered=True, categories=cats)
df = df.sort_values('month')
print (df)
month price
4 January 99.419109
3 February 100.114510
7 March 101.297762
0 April 102.478015
8 May 102.905340
6 June 96.114224
5 July 93.402928
1 August 94.868053
11 September 94.226803
10 October 95.606478
9 November 97.952169
2 December 97.278205