I have a table which looks like this:
ID
Start Date
End Date
1
01/01/2022
29/01/2022
2
03/01/2022
3
15/01/2022
4
01/02/2022
01/03/2022
5
01/03/2022
01/05/2022
6
01/04/2022
So, for every row i have the start date of the contract with the user and the end date. If the contract is still present, there will be no end date.
I'm trying to get a table that looks like this:
Feb
Mar
Apr
Jun
3
3
4
3
Which counts the number of active users on the first day of the month.
What is the most efficient way to calculate this?
At the moment the only idea that came to my mind was to use a scaffold table containing the dates i'm intereseted in (the first day of every month) and from that easily create the new table I need.
But my question is, is there a better way to solve this? I would love to find a more efficient way to calculate this since i would need to repeat the exact same calculations for the number of users at the start of the week.
This might help:
# initializing dataframe
df = pd.DataFrame({'start':['01/01/2022','03/01/2022','15/01/2022','01/02/2022','01/03/2022','01/04/2022'],
'end':['29/01/2022','','','01/03/2022','01/05/2022','']})
# cleaning datetime (the empty ones are replaced with the max exit)
df['start'] = pd.to_datetime(df['start'],format='%d/%m/%Y')
df['end'] = pd.to_datetime(df['end'],format='%d/%m/%Y', errors='coerce')
df['end'].fillna(df.end.max(), inplace=True)
dt_range = pd.date_range(start=df.start.min(),end=df.end.max(),freq='MS')
df2 = pd.DataFrame(columns=['month','number'])
for dat in dt_range:
row = {'month':dat.strftime('%B - %Y'),'number':len(df[(df.start <= dat)&(df.end >= dat)])}
df2 = df2.append(row, ignore_index=True)
Output:
month number
0 January - 2022 1
1 February - 2022 3
2 March - 2022 4
3 April - 2022 4
4 May - 2022 4
Or, if you want the format as in your question:
df2.T
month January - 2022 February - 2022 March - 2022 April - 2022 May - 2022
number 1 3 4 4 4
Related
I have created this DataFrame:
agency coupon vintage Cbal Month CPR year month Month_Predicted_DT
0 FHLG 1.5 2021 70.090310 November 5.418937 2022 11 2022-11-01
1 FHLG 1.5 2021 70.090310 December 5.549916 2022 12 2022-12-01
2 FHLG 1.5 2021 70.090310 January 5.238943 2022 1 2022-01-01
3 FHLG 1.5 2020 52.414637 November 5.514456 2022 11 2022-11-01
4 FHLG 1.5 2020 52.414637 December 5.550490 2022 12 2022-12-01
5 FHLG 1.5 2020 52.414637 January 5.182304 2022 1 2022-01-01
Created from this original df:
agency coupon year Cbal November December January
0 FHLG 1.5 2021 70.090310 5.418937 5.549916 5.238943
1 FHLG 1.5 2020 52.414637 5.514456 5.550490 5.182304
2 FHLG 2.0 2022 44.598755 3.346706 3.715995 3.902644
3 FHLG 2.0 2021 472.209165 5.802857 5.899596 5.627774
4 FHLG 2.0 2020 269.761452 7.090993 7.091404 6.567561
Using this code:
citi = pd.read_excel("Downloads/CITI_2022_05_22(5_22).xlsx")
#Extracting just the relevant months (M, M+1, M+2)
M = citi.columns[-6]
M_1 = citi.columns[-4]
M_2 = citi.columns[-2]
#Extracting just the relevant columns
cols = ['agency-term','coupon','year','Cbal',M,M_1,M_2]
citi = citi[cols]
todays_date = date.today()
current_year = todays_date.year
citi_new['year'] = current_year
citi_new['month'] = pd.to_datetime(citi_new.Month, format="%B").dt.month
citi_new['Month_Predicted_DT'] = pd.to_datetime(citi_new[['year', 'month']].assign(DAY=1))
citi_new = citi.set_index(cols[0:4]).stack().reset_index()
citi_new.rename(columns={"level_4": "Month", 0 : "CPR", "year" : "vintage"}, inplace = True)
For reference M is the current month, and M_1 and M_2 are month+1 and month+2.
My main question is that my solution for creating the 'Month_Predicted_DT column only works if the months in question do not overlap with the new year, so if M == November or M == December, then the year in Month_Predicted_DT is not correct for January and/or February. For example, Month_Predicted_DT for January rows should be 2023-01-01 not 2022. The same would be true if M was December, then I would want rows for Jan. and Feb. to be 2023-01-01 and 2023-02-01, respectively.
I have tried to come up with a workaround using df.iterrows or np.where but just can't really get a working solution.
You could try adding 12 months to dates that are over two months out:
#get first day of the current month
start = pd.Timestamp.today().normalize().replace(day=1)
#convert month column to timestamps
dates = pd.to_datetime(df["Month"]+f"{start.year}", format="%B%Y")
#offset the year if the date is not in the next 3 months
df["Month_Predicted_DT"] = dates.where(dates>=start,dates+pd.DateOffset(months=12))
So I am really new to this and struggling with something, which I feel should be quite simple.
I have a Pandas Dataframe containing two columns: Fiscal Week (str) and Amount sold (int).
Fiscal Week
Amount sold
0
2019031
24
1
2019041
47
2
2019221
34
3
2019231
46
4
2019241
35
My problem is the fiscal week column. It contains strings which describe the fiscal year and week . The fiscal year for this purpose starts on October 1st and ends on September 30th. So basically, 2019031 is the Monday (the 1 at the end) of the third week of October 2019. And 2019221 would be the 2nd week of March 2020.
The issue is that I want to turn this data into timeseries later. But I can't do that with the data in string format - I need it to be in date time format.
I actually added the 1s at the end of all these strings using
df['Fiscal Week']= df['Fiscal Week'].map('{}1'.format)
so that I can then turn it into a proper date:
df['Fiscal Week'] = pd.to_datetime(df['Fiscal Week'], format="%Y%W%w")
as I couldn't figure out how to do it with just the weeks and no day defined.
This, of course, returns the following:
Fiscal Week
Amount sold
0
2019-01-21
24
1
2019-01-28
47
2
2019-06-03
34
3
2019-06-10
46
4
2019-06-17
35
As expected, this is clearly not what I need, as according to the definition of the fiscal year week 1 is not January at all but rather October.
Is there some simple solution to get the dates to what they are actually supposed to be?
Ideally I would like the final format to be e.g. 2019-03 for the first entry. So basically exactly like the string but in some kind of date format, that I can then work with later on. Alternatively, calendar weeks would also be fine.
Assuming you have a data frame with fiscal dates of the form 'YYYYWW' where YYY = the calendar year of the start of the fiscal year and ww = the number of weeks into the year, you can convert to calendar dates as follows:
def getCalendarDate(fy_date: str):
f_year = fy_date[0:4]
f_week = fy_date[4:]
fys = pd.to_datetime(f'{f_year}/10/01', format= '%Y/%m/%d')
return fys + pd.to_timedelta(int(f_week), "W")
You can then use this function to create the column of calendar dates as follows:
df['Calendar Date]'] = list(getCalendarDate(x) for x in df['Fiscal Week'].to_list())
The start year, start month, end year, and end month are the inputs (like May'2022 to June'2024). If I need to calculate how many definite months are included in this period (like how many January, March, or December are in this period), how can I achieve this using Python?
Use date_range with DatetimeIndex.month_name and Index.value_counts:
s = pd.date_range('2022-05-01','2024-06-01', freq='MS').month_name().value_counts()
print (s)
June 3
May 3
April 2
March 2
July 2
December 2
November 2
October 2
February 2
January 2
September 2
August 2
dtype: int64
Last select by index in Series called s:
print (s['January'])
2
print (s['March'])
2
pandas date_range is a good helper here
import pandas as pd
ym = pd.date_range('2022-05-01','2024-06-01', freq='MS').strftime("%Y-%b").to_list()
print(ym)
def count_month(ym_list, month):
return(sum(month in s for s in ym_list))
print(count_month(ym, "May"))
print(count_month(ym, "Jan"))
and the output is
['2022-May', '2022-Jun', '2022-Jul', '2022-Aug', '2022-Sep', '2022-Oct', '2022-Nov', '2022-Dec', '2023-Jan', '2023-Feb', '2023-Mar', '2023-Apr', '2023-May', '2023-Jun', '2023-Jul', '2023-Aug', '2023-Sep', '2023-Oct', '2023-Nov', '2023-Dec', '2024-Jan', '2024-Feb', '2024-Mar', '2024-Apr', '2024-May', '2024-Jun']
3
2
I have a dataframe (df) with a column in datetime format YYYY-MM-DD ('date'). I am trying to create a new column that returns the policy year, which always starts on April 1st and thus the policy year for January through March will always be the prior calander year. There are dates that are rather old so setting up individual date ranges for the sample size below wouldn't be ideal
The dataframe would look like this
df['date']
2020-12-10
2021-02-10
2019-03-31
and output should look like this
2020
2020
2018
I now know how to get the year using df['date'].dt.year. However, I am having trouble getting the dataframe to convert each year to the respective policy year so that if df['date'].dt.month >= 4 then df['date'].dt.year, else df['date'].dt.year - 1
I am not quite sure how to set this up exactly. I have been trying to avoid setting up multiple columns to do a bool for month >= 4 and then setting up different columns. I've gone so far as to set up this but get ValueError stating the series is too ambiguous
def PolYear(x):
y = x.dt.month
if y >= 4:
x.dt.year
else:
x.dt.year - 1
df['Pol_Year'] = PolYear(df['date'])
I'm wasn't sure if this was the right way to go about it so I also tried a df.loc format for >= and < 4 but len key and value are not equal. Definitely think I'm missing something super simple.
I previously had mentioned 'fiscal year', but this is incorrect.
Quang Hoand had the right idea but used the incorrect frequency in the call to to_period(self, freq). For your purposes you want to use the following code:
df.date.dt.to_period('Q-MAR').dt.qyear
This will give you:
0 2021
1 2021
2 2019
Name: date, dtype: int64
Q-MAR defines fiscal year end in March
These values are the correct fiscal years (fiscal years use the year in which they end, not where they begin[reference]). If you you want to have the output using the year in which they begin, it's simple:
df.date.dt.to_period('Q-MAR').dt.qyear - 1
Giving you
0 2020
1 2020
2 2018
Name: date, dtype: int64
qyear docs
This is qyear:
df.date.dt.to_period('Q').dt.qyear
Output:
0 2020
1 2021
2 2019
Name: date, dtype: int64
I have a dataframe df1:
Month
1
3
March
April
2
4
5
I have another dataframe df2:
Month Name
1 January
2 February
3 March
4 April
5 May
If I want to replace the integer values of df1 with the corresponding name from df2, what kind of lookup function can I use?
I want to end up with this as my df1:
Month
January
March
March
April
February
May
replace it
df1.replace(dict(zip(df2.Month.astype(str),df2.Name)))
Out[76]:
Month
0 January
1 March
2 March
3 April
4 February
5 April
6 May
You can use pd.Series.map and then fillna. Just be careful to map either strings to strings or, as here, numeric to numeric:
month_name = df2.set_index('Month')['Name']
df1['Month'] = pd.to_numeric(df1['Month'], errors='coerce').map(month_name)\
.fillna(df1['Month'])
print(df1)
Month
0 January
1 March
2 March
3 April
4 February
5 April
6 May
You can also use pd.Series.replace, but this is often inefficient.
One alternative is to use map with a function:
def repl(x, lookup=dict(zip(df2.Month.astype(str), df2.Name))):
return lookup.get(x, x)
df['Month'] = df['Month'].map(repl)
print(df)
Output
Month
0 January
1 February
2 March
3 April
4 May
Use map with a series, just need to make sure your dtypes match:
mapper = df2.set_index(df2['Month'].astype(str))['Name']
df1['Month'].map(mapper).fillna(df1['Month'])
Output:
0 January
1 March
2 March
3 April
4 February
5 April
6 May
Name: Month, dtype: object